Compare commits

...

2370 Commits

Author SHA1 Message Date
97ff6cfd9c [Release only] Release 2.3 start using triton package from pypi (#123580) 2024-04-08 16:27:33 -04:00
fb38ab7881 Fix for MPS regression in #122016 and #123178 (#123385)
Fixes #122016 and #123178. This regression is related to an OS side change that requires a slight adjustment from us on PyTorch side to restore the previous behavior. Additionally we cleared out pre-MacOS13 related workarounds.

Before the fix on MacOS 14.4:

```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 3., 3.], device='mps:0')
```

After the fix:
```
python -c "import torch;x=torch.zeros(3, device='mps');x[1] = 1; x[2] = 3; print(x)"
tensor([0., 1., 3.], device='mps:0')
```

This also fixes complex number initialization and as such makes `nn.functional.rms_norm` pass on MacOS-14+

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123234
Approved by: https://github.com/malfet, https://github.com/kulinseth

(cherry picked from commit 05289a278c3eaca271061649982f38c435b50674)

Co-authored-by: Joona Havukainen <jhavukainen@apple.com>
2024-04-05 18:46:31 -04:00
23961cef85 [Release/2.3] Set py3.x build-environment name consistently (#123446)
https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != *py3.8*`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. 03b987fe3f
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247
Approved by: https://github.com/malfet

(cherry picked from commit 6fefc52a2b4f814c5bc85f4087a92ad7f6ee3abe)

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-04-05 09:01:19 -07:00
634cf5069a [Wheel] Change libtorch_cpu OpenMP search path (#123417) (#123442)
To prevent delocate from double-packing it, which makes Torch wheels
unusable with torch.compile out of the box

Fixes https://github.com/pytorch/pytorch/issues/122705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123417
Approved by: https://github.com/atalman

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2024-04-05 10:22:39 -04:00
12d0e693d0 update submodule onnx==1.16.0 (#123387)
Fixes #121258

CC @malfet @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123125
Approved by: https://github.com/malfet

(cherry picked from commit 19c2ed15c099c7ed9f96074584af6ab9da206f92)

Co-authored-by: pbialecki <piotr.bialecki@hotmail.de>
2024-04-04 20:47:38 -04:00
38acd812ab [MPS] Fwd-fix for clamp regression (#122148) (#123383)
Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it

- Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with
  ```
    /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library'
   ```
- Change the order of max and min call as it's apparently important for
  consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148
Approved by: https://github.com/huydhn

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2024-04-04 16:29:42 -07:00
b197f540bc Use numpy 2.0.0rc1 in CI (#123356)
Bump numpy version to 2.0.0rc1 in CI

Related to: https://github.com/pytorch/pytorch/issues/107302
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123286
Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/ZainRizvi

(cherry picked from commit 26b4ccf9d171a4abb3b25d9f88fc594ea5aca1ce)

Co-authored-by: atalman <atalman@fb.com>
2024-04-04 19:02:49 -04:00
dc81d19aac [CI] Test that NumPy-2.X builds are backward compatible with 1.X (#123354)
By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X

This has no affects on binary builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157
Approved by: https://github.com/atalman

(cherry picked from commit 03b987fe3fa93f398c0af5b40e512950c39a7cb6)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-04-04 19:00:35 -04:00
108305e47b Upgrade submodule pybind to 2.12.0 (#123355)
To fix https://github.com/pytorch/pytorch/issues/122056

Building with NP 2.0 allows me to run locally with both NP 2.0 and 1.26.
Any other test we should run @rgommers  ?

FYI @Skylion007 @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122899
Approved by: https://github.com/Skylion007

(cherry picked from commit 6c2f36c9845f310db8ece23c0d2e4ad6f702bc57)

Co-authored-by: albanD <desmaison.alban@gmail.com>
2024-04-04 18:07:42 -04:00
a8b009185d Make PyTorch compilable against upcoming Numpy-2.0 (#121880) (#123380)
Test plan:
```
% python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))"
2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9])
% python -c "import torch;print(torch.rand(3, 3).numpy())"
[[0.0931946  0.44874293 0.8480404 ]
 [0.93877375 0.10188377 0.67375803]
 [0.02520031 0.89019287 0.5691561 ]]

```
Fixes https://github.com/pytorch/pytorch/issues/121798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880
Approved by: https://github.com/albanD

(cherry picked from commit 38d9bb5abcc31ba97927a5399b88afe2cf60bf64)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-04-04 14:22:26 -07:00
b67b277268 Fix torch.clamp in MPS to handle NaN correctly (#121381) (#122785)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet

(cherry picked from commit 40acc84aafa82f00a5b3966302638f344bef07bd)

Co-authored-by: Roger Lam <mrlamroger@gmail.com>
2024-04-04 13:26:29 -07:00
a8f93a5c71 [ONNX] beartype to emit warning instead of error by default (#123363)
Making exporter more "robust" to advances in beartype tool.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123205
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-04-04 16:13:58 -04:00
fa07dc5132 [MPS] Fix naive matmul for BFloat16 (#123289)
Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate

Fixes https://github.com/pytorch/pytorch/issues/121583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731
Approved by: https://github.com/albanD

(cherry picked from commit 5498804ec2ac9aa62ba3bbf20149118142567d9b)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2024-04-04 16:04:46 -04:00
2a82d31f78 fix breaking changes for ONNX Runtime Training (#123271)
Fixes breaking changes for ONNX Runtime Training.

PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000
Approved by: https://github.com/malfet

(cherry picked from commit 765c3fc138fda4b49978403ee1394040221957cc)

Co-authored-by: Abhishek Jindal <abjindal@microsoft.com>
2024-04-03 18:52:06 -04:00
4bb5cb51e6 Fix swap_tensors path in _apply for modules that inherit from RNNBase (RNN, GRU, LSTM) (#122800) (#123116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122800
Approved by: https://github.com/albanD

(cherry picked from commit cc12668053ad847ff4a430e99eeebf99c136f3cd)
2024-04-02 16:16:37 -07:00
ef38d0572e nn.Module: use swap_tensors for Tensor subclasses (#122755) (#123106)
This fixes a bug when casting a module that has DTensor parameters. The old behavior will swap the .data field of the Tensor subclass which is incorrect behavior when dealing with tensor subclasses that may have multiple child tensors.

This uses the `swap_tensors` method to swap all of the tensors not just the .data field.

Test plan:

```
pytest test/distributed/_tensor/test_api.py -k 'test_distribute_module_casting'
python test/distributed/fsdp/test_wrap.py -k test_auto_wrap_smoke_test_cuda_init_mode1_cpu_offload0_use_device_id_True
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122755
Approved by: https://github.com/wanchaol, https://github.com/mikaylagawarecki

(cherry picked from commit e6ee8322d767ab241ce1651e7c178f539e8e3199)

Co-authored-by: Tristan Rice <rice@fn.lc>
2024-04-02 16:16:16 -07:00
5a53185e65 Remove cuda dependencies when building AOTriton (#122982) (#123179)
Downloading CUDA sometimes failed and breaks the build process, but
AOTriton does not need these packages. This commit comments out the
related downloading scripts.
2024-04-02 19:08:22 -04:00
bc9e23abb5 Fix performance regression and memory storage handling of Flash Attention on ROCM (#122857) (#122967)
This PR fixes the two major issues that was discovered after the initial merge of PR #121561
1. The Flash Attention support added by has severe performance regressions on regular shapes (power of two head dimensions and sequence lengths) compared with PR #115981. Its performance is worse than the math backend and only has numerical stability advantages. This PR fixes this problem.
2. There is a flaw of memory storage handling in PR #121561 which does not copy the gradients back to the designated output tensor. This PR removes the deprecated `TensorStorageSanitizer` class which is unnecessary due to the more flexible backward kernel shipped by PR #121561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122857
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
2024-04-02 18:53:19 -04:00
8194fae625 Pin protobuf to 3.20.2 on macOS (#123197)
The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, 3bc2bb6781.  This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29

The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too.  We want to eventually just have one requirements.txt file.

Fixes https://github.com/pytorch/pytorch/issues/122008
Fixes https://github.com/pytorch/pytorch/issues/121927
Fixes https://github.com/pytorch/pytorch/issues/121946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918
Approved by: https://github.com/kit1980
2024-04-02 15:08:09 -04:00
12acd4c9b3 [Cherrypick][DeviceMesh] Cache and reuse sliced result (#122975) (#123073)
Fixes #118849

Add a map for parent_to_child_mappings in _mesh_resources so we can cache and reuse submesh slicing result so that we can avoid recreating submesh and the underlying sub pg repeatedly, which could lead to funky behaviors.

We will follow up with reusing pg from the parent_mesh during submesh creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122975
Approved by: https://github.com/wanchaol
2024-04-02 15:05:07 -04:00
857797d148 [CherryPick] Inductor cpp wrapper: fix dtype of ShapeAsConstantBuffer (#122297) (#123064)
For `at::scalar_tensor` the default dtype will be `float` ([link to scalar_tensor](0d8e960f74/aten/src/ATen/native/TensorFactories.cpp (L856)), [link to default dtype](0d8e960f74/c10/core/TensorOptions.h (L551))) if we don't set the `dtype` value. However, the input scalar value is not necessarily a `float` value. With `torch::tensor(x)`, the dtype of the tensor will be decided according to the dtype of the scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122297
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-04-02 15:03:25 -04:00
233dfe4d6a Proper view support for jagged layout NestedTensor (#122854)
* Proper view support for jagged layout NestedTensor (#113279)

This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang

(cherry picked from commit cd6bfc7965fc5ae20720bae0994e332e56f819c0)

* Update executorch.txt

* Update executorch.txt

* Fix linter error

---------

Co-authored-by: Joel Schlosser <jbschlosser@meta.com>
Co-authored-by: Guang Yang <42389959+guangy10@users.noreply.github.com>
2024-04-02 11:46:53 -07:00
e22b534b10 Upgrade submodule oneDNN to v3.3.6 for release/2.3 (#122164) (#122930)
As the title. Including issue fixes for aarch64:
- https://github.com/oneapi-src/oneDNN/pull/1831
- https://github.com/oneapi-src/oneDNN/pull/1834

---

## Validation results
(on Intel CPU + Linux)
**Static quantization with Inductor on CV models**

Quant method | Geomean throughput ratio (v3.3.6/baseline)
-- | --
ptq | 0.982937
ptq (cpp wrapper) | 0.978384
qat | 0.978828

**Torchbench cpu userbenchmark with Inductor**

Items | Perf Geomean Ratio (v3.3.6/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.01x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 1.00x
eager_throughtput_bf16_train | 1.46x
eager_throughtput_fp32_train | 1.41x

**Dynamo benchmarks tests**
Precision | Shape | Wrapper | Thread | Eager old/new GEOMEAN | Inductor old/new GEOMEAN
-- | -- | -- | -- | -- | --
Float32 | Static | Default | Multiple | 1.003836812 | 1.003425
Float32 | Static | Default | Single | 1.000181451 | 0.999611
Float32 | Dynamic | Default | Multiple | 1.003980183 | 1.006563
Float32 | Dynamic | Default | Single | 1.000076939 | 0.999969
AMP | Static | Default | Multiple | 0.996824772 | 0.998715
AMP | Static | Default | Single | 0.996402574 | 1.001483
AMP | Dynamic | Default | Multiple | 0.994919866 | 1.000467
AMP | Dynamic | Default | Single | 0.9962054 | 1.000767

(on Aarch64)
https://github.com/pytorch/pytorch/pull/122164#issuecomment-2007912919

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122164
Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman
2024-04-02 12:57:11 -04:00
8602990e3f [CherryPick] Back out "[DeviceMesh] Add support for nD slicing (#119752)" (#121763) (#122495)
Summary:
Original commit changeset: e52b8809c8d8

Original Phabricator Diff: D54778906

We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248

Test Plan: Sandcastle

Reviewed By: satgera

Differential Revision: D54825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar

(cherry picked from commit e99fa0042cd3dcd2eded24585d59c53f2da9d9f5)
2024-03-28 14:25:08 -07:00
685cc955df [ROCm] Update triton rocm branch to release/2.3.x (#122493)
* Update triton rocm branch to release/2.3.x

* Remove ROCM_TRITION_VERSION and update to 2.3.0

* Remove unnecessary ROCm conditionalisation

* Skip failing UT
2024-03-28 14:18:37 -07:00
b1c2430fbd remove torchao dependency (#122635)
* remove torchao dependency (#122524)

Test Plan:
CI

```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2
```

```
buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4
```

Differential Revision: D55263008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524
Approved by: https://github.com/jerryzh168

(cherry picked from commit c677221798d8ce87c97aac1bd9ae34af0767c383)

* Update executorch.txt

* Update _decomposed.py

* Update executorch.txt

* Update executorch.txt

* Update executorch.txt

* Update executorch.txt

* Update executorch.txt

---------

Co-authored-by: Guang Yang <guangyang@meta.com>
Co-authored-by: Guang Yang <42389959+guangy10@users.noreply.github.com>
2024-03-28 12:25:12 -07:00
3002eb2556 [export] hack skip index_put_ in dce (#122683) (#122721)
Summary: Ideally we should do whats in the todo. Just doing this for now to unblock llama capture

Test Plan: capturing llama and using pt2e to quantize it

Differential Revision: D55354487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122683
Approved by: https://github.com/kimishpatel

(cherry picked from commit 41d24df08f72e059c4eebdde4315e63a9918406f)

Co-authored-by: Jacob Szwejbka <jakeszwe@meta.com>
2024-03-27 21:29:53 -07:00
e1a846d6b8 Fix auto_functionalize (#121990) (#122654)
Differential Revision: D54964130

When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990
Approved by: https://github.com/ydwu4, https://github.com/zou3519

(cherry picked from commit 0d845f7b0781f091452a5fd31de14e1c2117f3d4)

Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com>
2024-03-27 21:28:56 -07:00
4a9a8c606d [export] add pass to remove auto functionalized hop (#122246) (#122655)
Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use

Test Plan: added unit test

Differential Revision: D55103867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246
Approved by: https://github.com/angelayi

(cherry picked from commit c84f81b395fff969bbd2f784efad8ab1a8aa52de)

Co-authored-by: Jacob Szwejbka <jakeszwe@meta.com>
2024-03-27 21:05:15 -07:00
d3201f48b1 Revert "Revert "CI: Specify libc and libstdcxx versions in conda environments"" (#122523)
This reverts commit 74832f12fae2e1bc51bf1f9971dcd12c90a971f5.
2024-03-22 17:41:42 -04:00
74832f12fa Revert "CI: Specify libc and libstdcxx versions in conda environments" (#122497)
This reverts commit b4f90aae1b375bfe06d3c4a099240e06f93c81c4.
2024-03-22 11:27:50 -04:00
02cdb400d7 Use temporary name for triton package, fix lint (#122438)
* Use temporary name for triton package

* Fix lint
2024-03-21 17:30:38 -04:00
37257774c6 Triton wheel build using 2.3.x branch (#122403)
* Triton build 2.3.x

* Revert "[Release Only] Build triton using pinned version rather branch (#121765)"

This reverts commit d69c4219127e2cf5d9637b0daacc0a24e65f8133.

* Triton wheel change

* release
2024-03-21 12:52:21 -04:00
c4e5434423 necessary change to make torch2.3 work with triton2.2 (#122139) 2024-03-21 08:24:53 -04:00
b4f90aae1b CI: Specify libc and libstdcxx versions in conda environments (#121929)
Without this we get mismatches between the GLIBC and GLIBCXX ABI used
by conda packages vs pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556
Approved by: https://github.com/isuruf, https://github.com/malfet

(cherry picked from commit 7a53dedb07ed72b85d1e083ce38c43c7810fc5f1)

Co-authored-by: Peter Bell <peterbell10@live.co.uk>
2024-03-14 17:56:46 -04:00
94d6463255 [RELEASE ONLY CHANGES] Increase timeout for linux binary jobs, fix workflow lint (#121851)
* [release only] Increase timeout job for linux binary builds by 30min

* fix lint
2024-03-13 19:50:57 -04:00
6a89a753b1 [RELEASE ONLY CHANGES] Apply release only changes Release 2.3 (#121813)
* [Release only changes] Release only changes #2

* common+lint
2024-03-13 11:03:48 -04:00
d69c421912 [Release Only] Build triton using pinned version rather branch (#121765) 2024-03-12 19:05:23 -04:00
6725db07ae [RELEASE ONLY CHANGES] Apply release only changes Release 2.3 (#121726)
* Apply release only changes

* temp changes

* tweak

* fix

* Revert "tweak"

This reverts commit 38edcac21448829ac114c73423c84614628e2598.
2024-03-12 18:14:35 -04:00
86a2d67bb9 Simplify guards using info from previous guards (#121463)
Let me see what CI thinks about this one. Will add tests tomorrow.

Fixes https://github.com/pytorch/pytorch/issues/119917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463
Approved by: https://github.com/ezyang
2024-03-12 04:22:20 +00:00
703e83e336 Fix AARCH64 builds (#121700)
After https://github.com/pytorch/pytorch/pull/119992 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121700
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2024-03-12 04:17:47 +00:00
159f30331f [quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548)
Test Plan:
```
buck run caffe2/test:quantization_pt2e
```

Differential Revision: D54454707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548
Approved by: https://github.com/jerryzh168
2024-03-12 02:59:12 +00:00
7fc497711d Also test predispatch serialization (#121652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121652
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2024-03-12 02:37:59 +00:00
6ca9ae4f86 Express y grid > 2^16 in terms of z grid (#121554)
CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554
Approved by: https://github.com/aakhundov
2024-03-12 02:36:19 +00:00
fb1d7935bb [optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618
Approved by: https://github.com/albanD
2024-03-12 02:33:21 +00:00
a37e22de70 Add Flash Attention support on ROCM (#121561)
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)

- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
    * MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
    * Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
    * varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
    * Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
    * Kernel is selected according to autotune information from Triton.

Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API

This is a more extensive fix to #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
2024-03-12 01:16:53 +00:00
3a5f48d55f Port remove_split_ops to PT2 pre-grad passes (#121674)
Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain.

Test Plan:
Run test cases

Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34.

Reviewed By: frank-wei

Differential Revision: D54711064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674
Approved by: https://github.com/frank-wei
2024-03-12 01:15:19 +00:00
5b5d423c2e Benchmark templates (#118880)
Adding support for benchmarking templates in `benchmark_fusion`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880
Approved by: https://github.com/shunting314
2024-03-11 23:55:13 +00:00
7676433012 [AOTInductor] Reuse generated kernels between constant graph and main graph (#121564)
Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated.

Test Plan: Included in commit

Differential Revision: D54706767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-03-11 22:44:38 +00:00
272cf29e4d [FSDP2][BE] Refactored check_1d_sharded_parity to use mesh (#121357)
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
2024-03-11 22:34:42 +00:00
cd1dc5e484 Delete requirements-flake8.txt (#121657)
The file seems to be unused and also has different flake8 version compared to .lintrunner.toml, creating confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121657
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2024-03-11 22:29:25 +00:00
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda519ccd7f28de17689edd43c52743bc9.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
498a94a7f5 Don't install torchfix for python<3.9 (#121655)
Fixes https://github.com/pytorch/pytorch/issues/121591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121655
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-03-11 22:18:42 +00:00
b2f09c1859 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit d27509c384c9847cd2ac1f5d63ec143704b50591.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))
2024-03-11 22:18:36 +00:00
d1f45a93af Check for releasing GIL at compiletime (#116695)
Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster.

@albanD This is the GIL change extracted from #112607 as discussed.

Also fixes a potential use of a moved-from object introduced in #116560:
- `f` is captured by value in a lambda that may be used called times
- After `std::move(f)` the lambda is not safe to call anymore

CC @cyyever for that change
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-03-11 22:04:56 +00:00
fd13a56f61 Refactor some testing helpers for FX graph cache testing (#121520)
Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520
Approved by: https://github.com/eellison
2024-03-11 21:46:27 +00:00
e01b07e1e8 [ROCm] Autocast RNN Support (#121539)
Fixes #116361

Implements Autocast wrapper for miopen rnn's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-03-11 21:14:43 +00:00
fc712311ce port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617)
Summary: Does not change weights structure so compatible with const folding and realtime weights update

Test Plan: run added test cases

Reviewed By: frank-wei

Differential Revision: D53843428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617
Approved by: https://github.com/frank-wei
2024-03-11 20:51:11 +00:00
3461404869 [pt2 export]fix name collision on constant name (#121145)
Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args.

Test Plan: added test case

Differential Revision: D54435230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145
Approved by: https://github.com/zhxchen17
2024-03-11 20:40:59 +00:00
b091a32909 Add a section on release wiki about pytorchbot cherry-pick command (#121648)
I add a section about the new `pytorchbot cherry-pick` command in the release wiki so that more people know about it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121648
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-03-11 20:09:58 +00:00
dd2062c737 fix CMake FindCUDA module for cross-compiling (#121590)
Fix two cross-compiling issues in `FindCUDA.cmake` (xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/224).

1. `setup.py` reads the cached `CUDA_TOOLKIT_ROOT_DIR`, so it must be cached.
41286f1505/setup.py (L593)

I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9323.

2. [SBSA toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Cross&Distribution=Ubuntu&target_version=20.04&target_type=deb_network_cross) is in `sbsa-linux` directory. See also https://gitlab.kitware.com/cmake/cmake/-/issues/24192

I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121590
Approved by: https://github.com/malfet
2024-03-11 20:09:52 +00:00
5fd7f5c4e3 Include torch warn in each error in cudnn/Conv_v8.cpp (#120719)
Fixes #120702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120719
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-03-11 20:05:42 +00:00
9aa3fedb75 Slightly faster FX graph iterator (#121611)
Before:
```
iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s)
```

After:
```
iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611
Approved by: https://github.com/oulgen
2024-03-11 20:00:19 +00:00
ae22bdaefe Update torchbench commit pin, add sam_fast benchmark (#121420)
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```

sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
2024-03-11 19:48:53 +00:00
dccc1ca839 [torch] Use __prepare_scriptable__ for closures (#121553)
Summary:
This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229
The object is using __prepare_scriptable__ correctly inside of torch.jit.script()
but the clousre that is obtained below is using the non-prepared version.
This causes issues when the prepared and non-prepared versions are in different python modules.

Test Plan:
```
buck2 run mode/opt caffe2/test:jit -- -r test_decorator
```

Differential Revision: D54308741

Re-exporting, as #120806 #121307 were not properly merged.

Co-authored-by: Daniel Herrera <dherrera@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553
Approved by: https://github.com/huydhn, https://github.com/seemethere
2024-03-11 19:14:19 +00:00
b4160fd9c7 Clean up macOS x86 binaries build jobs (#116726)
This will stop building binaries for MacOS x86 on PyTorch including nightly and all future releases.  If we want this for 2.2, this can be cherry-picked there.

* [x] https://github.com/pytorch/pytorch/pull/116725
* [ ] https://github.com/pytorch/pytorch/pull/116726

Fixes https://github.com/pytorch/pytorch/issues/114602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116726
Approved by: https://github.com/atalman
2024-03-11 19:09:39 +00:00
8d03c59d59 Bring torch_xla pin to the latest torch_xla commit (03/08/2024). (#121529)
Update the torch_xla pin to a more recent one (03/08/2024). We need to make sure the torch_xla pin stays up-to-date so that pytorch can test against a up-to-date version of torch_xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121529
Approved by: https://github.com/atalman
2024-03-11 18:25:42 +00:00
39ed038f41 [TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541)
Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705:
> I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases

I would like to add that the version check is not necessary as in any case the outcome is the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541
Approved by: https://github.com/nWEIdia, https://github.com/albanD
2024-03-11 17:48:29 +00:00
6801595349 Fix round robin sharding (#121022)
Fix round robin sharding when there are no test times and sort_by_time=False

Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code

Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-03-11 17:30:12 +00:00
e2ac2dc13a Update NCCL submodule to v2.20.5 (#121635)
Updates NCCL submodule to 2.20.5 . Includes a lot of bugfixes for reductions and connections issues. Should also improve performance. We have been running 2.20.5 internally for a few weeks, the binary pip wheels have finally been published so we can update main.

Release notes here: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-20-5.html#rel_2-20-5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121635
Approved by: https://github.com/malfet
2024-03-11 17:23:59 +00:00
89add71168 fix synchronization behavior for copies with type change (#121341)
Fixes #121320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341
Approved by: https://github.com/albanD
2024-03-11 17:09:45 +00:00
03717430cc Fix lower precision check for MKLDNN on Windows (#121618)
Fixes #120788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121618
Approved by: https://github.com/xuhancn, https://github.com/jgong5, https://github.com/mingfeima, https://github.com/seemethere
2024-03-11 16:09:20 +00:00
e29004615f Add NEON accelerated torch.mv kernel (#119992)
This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed

Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table

| op | original | F32+NEON | F16+NEON|
| ---| -------- | ---------- | ----- |
| torch.mv(m, v) | 209.53 usec | 16.25 usec | 14.68 usec |
| torch.mv(m.t(), v) |  104.80 usec | 28.68 usec | 24.82 usec |

Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used)

To investigate:
 - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992
Approved by: https://github.com/mikekgfb
2024-03-11 16:00:01 +00:00
fac06a12c8 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-11 15:35:45 +00:00
6c11d3ce0c Add support to save safetensors checkpoint directly into onnx (#121001)
Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for
the newly exported ONNX model.

This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished.

Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001
Approved by: https://github.com/BowenBao, https://github.com/malfet
2024-03-11 15:21:59 +00:00
485f8ebc07 add __repr__ function to FunctionSchema for Python (#121484)
Fixes #118566

Unlike **OpOverload** or **OpOverloadPacket**, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the **\_\_repr__** function should show the class name as well as some other key information.

If you have any choices, please show me, thank you.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484
Approved by: https://github.com/Skylion007
2024-03-11 15:16:50 +00:00
d1510e01fa Upgrade submodule onednn to v3.3.5 (#120767)
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700.

Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2).
1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843)
Validation results with this patch: Latency increased by 0.60%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
metrics-1484287.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 418.851717
    }
}
oneDNN v3.3.4
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 421.381313
    }
}
```

2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592)
Validation results with this patch: Latency reduced by 3.23%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
(inductor speedup over eager mode) 2.876x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0

oneDNN v3.3.4
(inductor speedup over eager mode) 3.003x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0
```

3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962)
Validation results with this patch: Latency reduced by 0.85%
```
Tested on an AWS spr metal instance
oneDNN v3.1.1
(inductor speedup over eager mode) 1.120x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4

oneDNN v3.3.4
(inductor speedup over eager mode) 1.134x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4
```

The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues.
- https://github.com/pytorch/pytorch/issues/120211
- https://github.com/pytorch/pytorch/issues/120406
- https://github.com/pytorch/pytorch/issues/120547

-----

Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found.
I.  *torchbench CPU userbenchmark test*
Suite | Speedup
-- | --
eager_throughtput_bf16_infer | 1.001848
eager_throughtput_fp32_infer | 1.000257
eager_throughtput_fx_int8 | 1.003069
jit_llga_throughtput_amp_bf16 | 1.000682
jit_llga_throughtput_fp32 | 1.000313
eager_throughtput_bf16_train | 0.998222
eager_throughtput_fp32_train | 1.003384

II. *Inductor FP32/AMP inference tests*
i.  FP32 static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.09
timm_models | tinynet_a | multiple | 128 | 1.14

ii.  FP32 dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | alexnet | multiple | 128 | 1.08
torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98
torchbench | timm_efficientnet | multiple | 64 | 1.08

iii. AMP static default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | hf_distil_whisper | multiple | 1 | 1.18
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | BartForConditionalGeneration | multiple | 2 | 1.19
timm_models | eca_halonext26ts | multiple | 128 | 1.13
timm_models | nfnet_l0 | multiple | 128 | 1.13
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | spnasnet_100 | multiple | 128 | 1.15
timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22
timm_models | tinynet_a | multiple | 128 | 1.49
torchbench | hf_Bert_large | single | 1 | 1.16
huggingface | XLNetLMHeadModel | single | 1 | 1.07

iv.  AMP dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14
timm_models | nfnet_l0 | multiple | 128 | 1.15
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | tinynet_a | multiple | 128 | 1.34
huggingface | XLNetLMHeadModel | single | 1 | 1.09

-----

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-03-11 12:56:59 +00:00
605c0a28aa [dtensor][debug] force visualize_sharding not to print for empty tensors (#121217)
**Summary**
Current `visualize_sharding` code cannot print for empty DTensor objects which leads to an exception. This PR skips the print logic if the DTensor passed in has 0 element.
<img width="2165" alt="Pasted Graphic" src="https://github.com/pytorch/pytorch/assets/12968408/fa40b5e7-dad7-4d3a-a485-6a18067320ff">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121217
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385, #121382
2024-03-11 09:22:49 +00:00
3a5ab17bdc [dtensor][debug] visualize_sharding skip if the current rank is not in mesh (#121382)
**Summary**
We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385
2024-03-11 09:22:49 +00:00
b383123e37 [dtensor][debug] visualize_sharding only compute offset on the first rank in mesh (#121385)
**Summary**
avoid computing on ranks where we do not plan to visualize the DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121385
Approved by: https://github.com/wanchaol
2024-03-11 09:22:31 +00:00
9c50ecc84b Fix get_rank under a non-default group. (#120481)
Fixes #120213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120481
Approved by: https://github.com/yifuwang
2024-03-11 05:40:54 +00:00
7cc476ea16 [dynamo] Fix support for nn.Parameter constructor (part 1) (#120163)
This captures calls to `torch.nn.Parameter` by lifting them to graph inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163
Approved by: https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #121086
2024-03-11 05:14:42 +00:00
32488b0664 [dynamo] Support _unsafe_set_version_counter (#121086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086
Approved by: https://github.com/yanboliang
2024-03-11 05:14:42 +00:00
7a4e451184 [Dynamo] Fix function overrides (#120885)
To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case

Fixes #120653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885
Approved by: https://github.com/yanboliang
2024-03-11 02:18:43 +00:00
f11f2b0d55 split predispatch pass into multiple passes (#121592)
Summary:
It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info.

This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information.

Reviewed By: frank-wei

Differential Revision: D53579545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592
Approved by: https://github.com/frank-wei
2024-03-11 00:30:55 +00:00
13e8181b7b relax assertion on fake shape (#121599)
Summary: Seems like if you use `capture_pre_autograd_graph` fake tensor shapes can be ints instead of symints.

Test Plan: fixes the AssertionError in N5057219

Differential Revision: D54729142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121599
Approved by: https://github.com/angelayi, https://github.com/BoyuanFeng
2024-03-10 22:51:10 +00:00
660ec3d38d [Export] Fix bug removing node from wrong graph (#121574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121574
Approved by: https://github.com/ydwu4
2024-03-10 04:46:11 +00:00
41286f1505 [IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575)
`hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575
Approved by: https://github.com/Chillee
2024-03-10 00:55:25 +00:00
60cd2a43ca [DeviceMesh] Add support for nD slicing (#119752)
Fixes one of the issue mentioned in #118639
@mvpatel2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752
Approved by: https://github.com/wanchaol
2024-03-10 00:16:37 +00:00
e90cddb0d3 [inductor] Log triton kernel source and metadata on failure (#120494)
If Triton compilation fails it's much easier to debug when given the
kernel source directly, versus a PyTorch repro.

This would have helped root cause
https://github.com/pytorch/pytorch/issues/118589 almost immediately

Differential Revision: [D54119568](https://our.internmc.facebook.com/intern/diff/D54119568/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120494
Approved by: https://github.com/peterbell10, https://github.com/eellison, https://github.com/jansel
2024-03-09 20:12:27 +00:00
168a04e752 [inductor] Changes to support newer triton pin (#121267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267
Approved by: https://github.com/lezcano
ghstack dependencies: #121438
2024-03-09 18:17:36 +00:00
459c5bca58 [inductor] Refactor common triton imports into one function (#121438)
This means when codegen depends on a particular import we only need to
add it in one place and it's applied to all triton kernels.

This also changes codegen slightly so instead of generating
`@pointwise` we now generate `@triton_heuristics.pointwise` just so
the imports are the same for all kernel types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438
Approved by: https://github.com/lezcano
2024-03-09 18:17:36 +00:00
8c96b4367a Remove opmath cast for im2col decomp (#121363)
It is unclear why opmath cast is needed for im2col decomp, given that the decomposition is mainly performing padding, slicing, indexing and shape manipulation. There is no need for performing these operations in a higher precision, and in doing so it requires more memory and yields less performance.

Sample script to demonstrate inserted cast before this change

```python
import torch
from torch._decomp.decompositions import im2col

def func(x):
    return torch.nn.functional.unfold(
        x, kernel_size=[3, 1], padding=[2, 0], dilation=1, stride=1
    )

x = torch.rand(1, 1, 5, 5, dtype=torch.float16)

eo = torch._dynamo.export(
    func, aten_graph=True, decomposition_table={torch.ops.aten.im2col.default: im2col}
)(x)
eo.graph_module.print_readable()
```

```
class GraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: "f16[1, 1, s0, s0]";

        arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        arg0_1 = arg0

        _to_copy: "f32[1, 1, s0, s0]" = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32)
        ...
        constant_pad_nd: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.constant_pad_nd.default(_to_copy, [0, 0, 2, 2], 0.0);  _to_copy = None
        ...
        slice_1: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(constant_pad_nd, 0, 0, 9223372036854775807);  constant_pad_nd = None
        slice_2: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807);  slice_1 = None
        index: "f32[1, 1, 3, s0 + 2, 1, s0]" = torch.ops.aten.index.Tensor(slice_2, [None, None, unsqueeze_5, add_3]);  slice_2 = unsqueeze_5 = add_3 = None
        permute: "f32[1, 1, 3, 1, s0 + 2, s0]" = torch.ops.aten.permute.default(index, [0, 1, 2, 4, 3, 5]);  index = None
        ...
        view: "f32[1, 3, s0**2 + 2*s0]" = torch.ops.aten.view.default(permute, [1, 3, mul]);  permute = mul = None
        _to_copy_1: "f16[1, 3, s0**2 + 2*s0]" = torch.ops.aten._to_copy.default(view, dtype = torch.float16);  view = None
        return pytree.tree_unflatten([_to_copy_1], self._out_spec)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121363
Approved by: https://github.com/lezcano
2024-03-09 15:37:27 +00:00
71d0202627 [dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-03-09 08:28:22 +00:00
cf9742371c Revert "Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)"
This reverts commit 752d164b2f0d401042de4a75f36f7e84bae91daa.

Reverted https://github.com/pytorch/pytorch/pull/119685 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is crashing on ROCm 752d164b2f ([comment](https://github.com/pytorch/pytorch/pull/119685#issuecomment-1986773384))
2024-03-09 07:20:53 +00:00
761783a4ff [profiler] Fix recorded profiler step number (#121127)
Fixes [121126](https://github.com/pytorch/pytorch/issues/121126)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121127
Approved by: https://github.com/briancoutinho
2024-03-09 06:54:51 +00:00
242e03ba86 [dtensor] add async_op option to redistribute and some refactor (#121477)
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it

This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.

By default we set this to False, this is to follow the semantics of the c10d
collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337
2024-03-09 06:17:23 +00:00
a6a67da333 [quant] Add error check for input_edge annotation (#121536)
Summary:
Raises error when an input edge contains non-Node elements like constant values etc in annotation.

Test Plan:
python test/test_quantization.py -k test_input_edge_sanity_check

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121536
Approved by: https://github.com/andrewor14
2024-03-09 06:13:04 +00:00
e8836759d0 [export] Add effect token to export (#121424)
Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %attr : [num_users=2] = placeholder[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {})
    %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {})
    %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {})
    return (getitem_2, add)
```

During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like:
```
graph():
    %attr_1 : [num_users=2] = get_attr[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {})
    %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {})
    return (add,)
```

Serialization support will be added in a followup.
Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet.

Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424
Approved by: https://github.com/tugsbayasgalan
2024-03-09 02:43:26 +00:00
eb3919944d [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "~/complex_ddp.py", line 72, in <module>
[rank0]:     main()
[rank0]:   File "~/complex_ddp.py", line 64, in main
[rank0]:     loss.backward()
[rank0]:   File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```

I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-03-09 02:00:54 +00:00
752d164b2f Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685
Approved by: https://github.com/cpuhrsch
2024-03-09 02:00:50 +00:00
13a25c647f [export] improve binary op fast path broadcast check (#121546)
# Context
I believe we have an incorrect guard being created during FakeTensor's binary op fast path.

Consider this case
```
# op.shape: (10, 192); final_shape: (s0, 10, 192)
# Guard Ne(s0, 10) is created when we create SymBool(10 == s0)
if isinstance(op, torch.Tensor) and op.shape == final_shape:
    break
```

As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape.
* If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`).
* If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`.

This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`.

Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases:
  1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable.
  2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins?
  3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant.

# Test
```
$ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes

torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546
Approved by: https://github.com/aakhundov
2024-03-09 01:49:42 +00:00
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
6791b0c09e Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632)
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
2024-03-09 01:08:37 +00:00
ca9678405a [CUDA graphs] Pool argument for make_graphed_callables (#121475)
It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475
Approved by: https://github.com/eellison, https://github.com/eqy
2024-03-09 00:15:38 +00:00
b2f19dd284 [C10d][UCC] Retain CUDA context in progress_loop (#121446)
UCC requires CUDA context be present, while `progress_loop` f61192b014/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L333) runs on the side thread and it does not have context present (even though it sets the device).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121446
Approved by: https://github.com/kwen2501
2024-03-09 00:09:47 +00:00
ed8eebd1c2 Changed cublas repdocubility URL (#121534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534
Approved by: https://github.com/Skylion007
2024-03-08 23:46:21 +00:00
b0a0850a5c [DCP] Replaced storage() with untyped_storage() (#121538)
Let us try to remove this warning 😄 :
```
[rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]:  if tensor.storage().size() != tensor.numel():
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538
Approved by: https://github.com/wz337, https://github.com/fegin
2024-03-08 23:46:17 +00:00
8887c95004 [inductor] Skip welford combine on first reduciton loop iteration (#121488)
On the first iteration we short circuit `welford_reduce` since we know
the accumulators are filled with the default values.

This is split out from #120330 to hopefully avoid the meta-internal failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488
Approved by: https://github.com/lezcano
2024-03-08 23:40:48 +00:00
fe78cf040b [profiler] add a function to allow adding preset user-defined metadata to traces (#121487)
Summary:
`add_metadata_json` function in profiler can only work when being called during trace collection. However, sometimes we want to pass in some user-defined metadata and amend to the trace before trace collection starts, e.g. when the profiler is defined.
This PR add a function `preset_metadata_json` for this purpose. The preset metadata will be stored and amended to the trace later.

Differential Revision: D54678790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121487
Approved by: https://github.com/aaronenyeshi
2024-03-08 23:18:48 +00:00
9eb8fae02d Revert "Fix round robin sharding (#121022)"
This reverts commit effdea5fc62c6bf13cb8035f7bfcc205f05a8b6a.

Reverted https://github.com/pytorch/pytorch/pull/121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](https://github.com/pytorch/pytorch/pull/121022#issuecomment-1986552662))
2024-03-08 23:16:24 +00:00
bc02fca358 [dtensor] to_local backward grad placement passthrough (#121474)
to_local accepts a `grad_placements` if user choose to pass, previously
we enforce the grad_out to be the "same" placement as the current
DTensor for safety.

But I realized that we DO NOT need to enforce this constraint. Why?
backward placement does not need to be the same as fwd tensor placement, this
is already the case for param vs param.grad (i.e. param can be replicate
and grad can be partial), so we should not restrict this to activation
vs activation grad too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474
Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang
2024-03-08 23:11:49 +00:00
9373ad0bb8 Switch cudagraph backend to cudagraph trees (#121019)
Switch torch.compile(..., backend="cudagraphs") to use cudagraph trees. Enabled a few test in cudagraph_trees and note that there is another test suite existing for cudagraphs backend: https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_cudagraphs.py.

This is basically the inductor cudagraphs without inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121019
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #121017, #121018
2024-03-08 22:56:26 +00:00
7b3febdca7 Change assertion throw to error message for const_run_impl call. (#121396)
Summary:
Currently we do not have a easy mechanism to distinguish between models created with some specific config.
We use a warning instead of failing directly.

Test Plan: Messaging change only.

Reviewed By: aakhundov

Differential Revision: D54622522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121396
Approved by: https://github.com/chenyang78
2024-03-08 22:48:43 +00:00
038b2e8780 [c10d] Add complex support for P2P (#121240)
Fixes the following error when `tensor` is a complex tensor:
```
[rank0]:     return pg.send([tensor], dst, tag)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240
Approved by: https://github.com/shuqiangzhang
2024-03-08 22:47:49 +00:00
4af0e634bf Add Cudagraphs disable checking (#121018)
Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018
Approved by: https://github.com/ezyang
ghstack dependencies: #121017
2024-03-08 22:47:24 +00:00
7d0ad5c6f0 [FSDP2] Zeroed padded tensor in _apply (#121509)
This PR replaces the `Tensor.resize_` with an explicit zero-ing of the padded tensor. Uninitialized padding is not good since it can give false positive NaNs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121509
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
2024-03-08 22:31:19 +00:00
f2d5e96db4 [export] Add docs for 2.3 release (#121466)
- Added docs about non-strict export
- Added example using derived dims
- Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480)
- Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466
Approved by: https://github.com/zhxchen17
2024-03-08 22:29:48 +00:00
2c2d6ce515 Revert "CI sanity check test for env vars (#120519)"
This reverts commit f43b9c56c598b3a0f4d8e1d85f1e67b8f273d235.

Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow d27509c384 https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))
2024-03-08 22:01:35 +00:00
35d3adb4b0 Add ATen Op _chunk_cat and _chunk_cat.out (#121081)
# Motivation

In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.

### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):

Input tensors:
```
AAAA   BBB   CC
AAAA   BBB
       BBB
```

Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```

### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):

Input tensors:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Reduce-scatter-copy-in first pad:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```

The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.

# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:

```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```

This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.

## Requirements on input

1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
2024-03-08 21:48:12 +00:00
a656e12bf5 Disable test_torch_name_rule_map_updated in code (#120627)
I am getting tired of this test  ;-;

It gets disabled because it's broken, and then gets fixed, but something breaks it while it was disabled so its still broken and the infra is not handling it well.

Disable issue is https://github.com/pytorch/pytorch/issues/114831
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120627
Approved by: https://github.com/yanboliang
2024-03-08 21:00:51 +00:00
82bb06334d Update python binding for in-place foreach to return List[Tensor] (#121405)
fixes #104817
taking over #118622

```c++
// _foreach_atan_
static PyObject * THPVariable__foreach_atan_(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  HANDLE_TH_ERRORS
  static PythonArgParser parser({
    "_foreach_atan_(TensorList self)",
  }, /*traceable=*/false);

  ParsedArgs<1> parsed_args;
  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
  if(_r.has_torch_function()) {
    return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
  }
  // aten::_foreach_atan_(Tensor(a!)[] self) -> ()

  // auto dispatch__foreach_atan_ = [](at::TensorList self) -> at::TensorList {
  auto dispatch__foreach_atan_ = [](at::TensorList self) -> void {
    pybind11::gil_scoped_release no_gil;
    at::_foreach_atan_(self);
  };
  dispatch__foreach_atan_(_r.tensorlist(0));
  PyObject* self_tensorlist = _r.args[0];
  Py_INCREF(self_tensorlist);
  return self_tensorlist;
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}
...
// _foreach_div_
static PyObject * THPVariable__foreach_div_(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  HANDLE_TH_ERRORS
  static PythonArgParser parser({
    "_foreach_div_(TensorList self, ScalarList scalars)",
    "_foreach_div_(TensorList self, Tensor other)",
    "_foreach_div_(TensorList self, TensorList other)",
    "_foreach_div_(TensorList self, Scalar scalar)",
  }, /*traceable=*/false);

  ParsedArgs<2> parsed_args;
  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
  if(_r.has_torch_function()) {
    return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
  }
  switch (_r.idx) {
    case 0: {
      // aten::_foreach_div_.ScalarList(Tensor(a!)[] self, Scalar[] scalars) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, scalars);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.scalarlist(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
    case 1: {
      // aten::_foreach_div_.Tensor(Tensor(a!)[] self, Tensor other) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, other);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.tensor(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
    case 2: {
      // aten::_foreach_div_.List(Tensor(a!)[] self, Tensor[] other) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, other);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.tensorlist(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
    case 3: {
      // aten::_foreach_div_.Scalar(Tensor(a!)[] self, Scalar scalar) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, scalar);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.scalar(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
  }
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121405
Approved by: https://github.com/soulitzer
2024-03-08 21:00:01 +00:00
d27509c384 [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-08 20:43:29 +00:00
f43b9c56c5 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-08 20:28:50 +00:00
75bb049d38 Skip AOT Inductor test_cond_* tests on ROCm (#121522)
Summary: The newly added tests in https://github.com/pytorch/pytorch/pull/121120 are failing in the `ciflow/periodic` jobs. Here we skip those on ROCm to avoid the need to disable those tests manually on ROCm.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond_nested
...
----------------------------------------------------------------------
Ran 6 tests in 72.122s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121522
Approved by: https://github.com/huydhn, https://github.com/malfet
ghstack dependencies: #121120
2024-03-08 20:13:55 +00:00
53d5276d69 Improve Dynamo support for torch function and class methods in general (#121365)
I was originally trying to solve https://github.com/pytorch/pytorch/issues/120799 but got sidetracked along the way.
This PR contains a couple fixes. Let me know if you want me to split them up!

- Properly handle invalid user code when "super()" is called from non-method/classmethod. It will now properly raise the same error as CPython
- Fix base VariableTracker `__str__` method shadowing all `__repr__` methods defined in subclasses
- Fix accessing a classmethod on a user object to bind "cls" and not "self"
- Fix custom class handling of super() call to properly handle mixed regular/class/static methods

Locally , test_repros.py -k test_batch_norm_act still fails where the generated graph module is:
```
Call using an FX-traced Module, line 8 of the traced Module's generated forward function:
    x = self.forward(l_x_);  self = l_x_ = None
    x_1 = self.L__self___act(x);  x = None
```
note that "self" is being unset on the first line even though it is used on the second one.
For reference, this is the test c268ce4a6d/test/dynamo/test_repros.py (L1368-L1369)
I cannot figure out where the generated forward function comes from though, any hint would be welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121365
Approved by: https://github.com/jansel
2024-03-08 20:03:49 +00:00
c0996866f4 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit 4305c64fea154ee1ab566e19bd7568753fc30916.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))
2024-03-08 20:01:03 +00:00
c78f72d7e7 [c10d] Deprecate torch.distributed.pipeline (#121464)
In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464
Approved by: https://github.com/wz337, https://github.com/awgu
2024-03-08 19:55:02 +00:00
27a0900946 Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621)"
This reverts commit 25c74a93cdf67545a4e3e1bedf2dbabbddfc5845.

Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/izaitsevfb due to depends on #120076, which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-1986324796))
2024-03-08 19:50:57 +00:00
937e89f252 cudagraphs backend refactoring (#121017)
This is just some refactoring.. no functional changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121017
Approved by: https://github.com/ezyang
2024-03-08 19:47:41 +00:00
bc117898f1 Revert "Update XLA pin (#121501)"
This reverts commit 9d83f9dc0e4535f6535389201bc3c4a37f3305e3.

Reverted https://github.com/pytorch/pytorch/pull/121501 on behalf of https://github.com/malfet due to We are trying to revert underlying change first ([comment](https://github.com/pytorch/pytorch/pull/121501#issuecomment-1986289409))
2024-03-08 19:34:44 +00:00
22cd2658b4 Disable GroupRegistry's thread isolation by default (#121457)
Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes).

However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups.

This PR fixes the issue by:
- Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry.
- Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly.

Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457
Approved by: https://github.com/wanchaol
2024-03-08 19:31:24 +00:00
2c9c57c061 Only profiling when it's enabled. (#121404)
Summary:
The profiling, even when disabled, takes up about 1.5% cpu for a model I'm looking into.

This patch just splits into with/without profile runs.

The potential downside is that now the script can't enable profiling in itself. It doesn't seem to be used anywhere. If that's a crusial usecase, we can do something about it but ideally we wouldn't.

Test Plan:
Link with profiles:
https://fburl.com/scuba/strobelight_services/ihxsl7pj

```
buck2 run fbcode//caffe2/test/cpp/jit:jit
```

Reviewed By: zhxchen17

Differential Revision: D54066589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121404
Approved by: https://github.com/zhxchen17
2024-03-08 19:23:14 +00:00
df06b94778 Add complex support to parametrizations.spectral_norm (#121452)
Fixes https://github.com/pytorch/pytorch/issues/121091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121452
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2024-03-08 19:17:20 +00:00
0f3f4f5534 Revert "[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204)"
This reverts commit 4186c365313e909dfc8574c4469e5015439c2924.

Reverted https://github.com/pytorch/pytorch/pull/121204 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/121204#issuecomment-1986252526))
2024-03-08 19:08:50 +00:00
d55d803812 Add operator length hint support (#121495)
Seemed like an easy operator to squeeze into Python 2.3 . Added a simple test. Partially addresses #116396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121495
Approved by: https://github.com/albanD
2024-03-08 19:08:33 +00:00
9b03a06288 [BE] [MPS] Fix out resize logic in torch.where (#121476)
By deleting `where_mps`  and registering MPS dispatch for `where_kernel`.
As result of this change resizing and type-checking logic is shared between MPS, CPU and  CUDA backends.

Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473, #121494
2024-03-08 18:59:37 +00:00
9cc89970a9 [BE] Cleanup where_self_out (#121494)
- Avoid extra assignments by using ternary instead of if-else
- Do not call type-cast unless it is needed (in most cases only one of two arguments will need to be custed)
- Avoid extra assignment for condition_, by calling `cast` under `if` condition
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121494
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473
2024-03-08 18:59:37 +00:00
1866ee6735 Enable out OpInfo testing for torch.where (#121473)
And fix behavior discrepancy between CPU and CUDA by raising an error when `out.dtype` is unexpected

Fixes https://github.com/pytorch/pytorch/issues/121397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121473
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-03-08 18:59:37 +00:00
0dd21c0c34 Update Quantizable LSTM to support QAT (#121448)
Summary: Title.

Test Plan:
* CI
* N3684627

Differential Revision: D54653542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121448
Approved by: https://github.com/andrewor14
2024-03-08 18:55:50 +00:00
b52e0bf131 Deprecate torch.autograd.function.traceable, is_traceable (#121413)
- There are no usages of this internally.
- There are very few usages of this in OSS (most of these are forks of old
repositories).
- This flag doesn't do anything.

We're deprecating it to prevent confusion. I will delete it immediately
after the branch cut.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121413
Approved by: https://github.com/albanD, https://github.com/soulitzer
2024-03-08 18:41:07 +00:00
08460f4bae [tp] remove deprecated tp_mesh_dim arg (#121432)
This PR removes the deprecated tp_mesh_dim arg to prepare for release.
As we deprecated this arg for a while (by throwing deprecating
messages), we should remove it before the release

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432
Approved by: https://github.com/wz337
ghstack dependencies: #121431
2024-03-08 17:46:44 +00:00
30982ce072 [tp] doc fixes (#121431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431
Approved by: https://github.com/wz337
2024-03-08 17:46:44 +00:00
effdea5fc6 Fix round robin sharding (#121022)
Fix round robin sharding when there are no test times and sort_by_time=False

Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code

Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-03-08 17:01:34 +00:00
9d83f9dc0e Update XLA pin (#121501)
To 8078b8f38c

Fixes regression caused by https://github.com/pytorch/pytorch/pull/120076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121501
Approved by: https://github.com/Skylion007, https://github.com/aakhundov, https://github.com/albanD
2024-03-08 16:53:10 +00:00
a2a8c1fda0 [AOTDispatch] Return mutated inputs directly when keeping mutations (#120514)
Fixes #120242

The example from the issue now results in the graph
```python
def forward(self, arg0_1, arg1_1):
    sin = torch.ops.aten.sin.default(arg0_1);  arg0_1 = None
    copy_ = torch.ops.aten.copy_.default(arg1_1, sin);  arg1_1 = sin = None
    return (copy_,)
```

and the corresponding inductor kernel eliminates the intermediate buffer
completely

```python
def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        # Source Nodes: [sin], Original ATen: [aten.sin]
        stream0 = get_raw_stream(0)
        triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0)
        del arg0_1
    return (arg1_1, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120514
Approved by: https://github.com/ezyang, https://github.com/oulgen, https://github.com/lezcano
2024-03-08 16:33:26 +00:00
f7ec984b1b [DTensor][XLA] support XLA backend in distirbute_module API (#121355)
Addresses #92909  cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355
Approved by: https://github.com/wanchaol
2024-03-08 15:47:33 +00:00
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
c253d1c1db Add links to _ex variants in all linalg functions that support them (#121451)
Fixes https://github.com/pytorch/pytorch/issues/96632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121451
Approved by: https://github.com/ezyang
2024-03-08 12:19:16 +00:00
975d428425 [Quant] Add the operator of decomposed fake quant per channel (#121297)
**Summary**
Add the operator of `quantized_decomposed.fake_quant_per_channel` and test the forward and backward of this op with comparing to ATen.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```

**Next Step**
Optimize the performance: from the generated code of forward and backward graph, the code didn't vectorize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121297
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2024-03-08 10:51:37 +00:00
8ed0932172 Update link to OpenVINO backend in torch.compiler.rst (#121303)
This is a permalink, so it will remain active regardless of documentation version changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303
Approved by: https://github.com/soulitzer
2024-03-08 08:17:13 +00:00
b3f24b57fb fix accidental specialization with faketensor input checks (#121460)
Summary: When fake tensors are passed to a graph module and we do runtime assertions on them, we can accidentally trigger specialization guards. It's better to just relax the checking for these.

Test Plan: confirmed that problem in T181400371 is now fixed

Differential Revision: D54658960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121460
Approved by: https://github.com/angelayi
2024-03-08 08:02:37 +00:00
2e789ad522 [DCP][state_dict][doc] Update the distributed state_dict document (#121290)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290
Approved by: https://github.com/LucasLLC
ghstack dependencies: #121273, #121276
2024-03-08 07:58:18 +00:00
e628f2cc66 suggested fixes for congruences (#121418)
Differential Revision: D54636152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121418
Approved by: https://github.com/zhxchen17
2024-03-08 07:19:51 +00:00
96ed37ac13 [DCP] Makes async_save public (#121325)
Makes async_save public

Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
2024-03-08 05:13:13 +00:00
13366a101a [DCP][state_dict][doc] Fix the documents for distributed_state_dict (#121276)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121276
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #121273
2024-03-08 03:29:47 +00:00
72dd9b2430 [inductor] Make some improvements to FX graph caching (#117888)
Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes:
* Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented
* Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk)
* Include various global settings that could affect compilation	in the cache key calculation.
* Handle a few config settings that break key calculation.
* Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards)
* Skip caching when freezing is	enabled	(Freezing can embed constants that wouldn't be static across runs).
* Fix the clear() method to not	throw when the cache /tmp dir doesn't exist

Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888
Approved by: https://github.com/eellison
2024-03-08 02:30:49 +00:00
909d73d8cb [DCP] Removes no_dist and coordinator_rank from public DCP API's (#121317)
[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's

Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317
Approved by: https://github.com/fegin
2024-03-08 02:14:12 +00:00
23ac0cd561 more passing dynamo tests (#121378)
These are just tests that I noticed passed on current main

Running:
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/dynamo/test_dynamic_shapes.py test/dynamo/test_compile.py -k 'test_export_decomp_dynamic_shapes or test_export_dynamic_dim_cleanup_dynamic_shapes or test_export_multi_dynamic_dim_constraint_dynamic_shapes or test_export_multi_dynamic_dim_unsafe_relationship_dynamic_shapes or test_export_no_raise_dynamic_shapes or test_export_preserve_constraints_as_metadata_scalar_dynamic_shapes or test_export_raise_on_relationship_dynamic_shapes or test_exported_graph_serialization_dynamic_shapes  or test_retracibility_dict_container_inp_out_dynamic_shapes or test_retracibility_nested_list_out_dynamic_shapes or test_exception_table_e2e_2_dynamic_shapes or test_exception_table_e2e_dynamic_shapes or test_exception_table_parsing_dynamic_shapes or test_inference_mode_dynamic_shapes or test_inplace_view_on_graph_input_dynamic_shapes or test_numpy_torch_operators_dynamic_shapes or test_py311_jump_offset_dynamic_shapes or test_lazy_module_no_cls_to_become_dynamic_shapes or test_batchnorm_e2e_dynamic_shapes or test_functools_wraps_dynamic_shapes or test_jit_trace_errors_dynamic_shapes or test_multi_import_dynamic_shapes or test_requires_grad_guards_with_grad_mode2_dynamic_shapese or test_dynamo_signatures'
```
BEFORE: `1 failed, 1 passed, 22 skipped, 1372 deselected`
AFTER: `24 passed, 1372 deselected`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121378
Approved by: https://github.com/oulgen
2024-03-08 01:59:01 +00:00
4186c36531 [nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121204
Approved by: https://github.com/Skylion007
2024-03-08 01:54:25 +00:00
0f8c9acc29 Revert "[fake_impls] Fix seed/offset device for attention kernels (#120839)" (#121447)
This reverts commit df3c8b8390bc601072b0ee9b2c39e07adf370fe2.

It regressed cudagraphs+PT2 performance on SDPA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121447
Approved by: https://github.com/Chillee
2024-03-08 01:48:23 +00:00
dc514b967e [dtensor][TP] check funcol calls and improve doc for loss parallel (#121366)
Since CommDebugMode is fixed, we can check that loss parallel is working as expected.

Under loss parallel, the forward computation should invoke 3 all-reduces, and the backward computation should invoke no functional collectives.

Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121366
Approved by: https://github.com/wanchaol
2024-03-08 01:41:31 +00:00
25c74a93cd [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-03-08 01:37:53 +00:00
7dc1ab8989 make dyanmo work with _LazyGraphModule.lazy_forward (#121259)
Fix https://github.com/pytorch/pytorch/issues/121198 .

We previously already trigger the real recompilation for LazyGraphModule when it runs thru dynamo context. But people may pass in LazyGraphModule._lazy_forward rather than the LazyGraphModule instance itself. This PR handles that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121259
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-03-08 01:37:39 +00:00
9bff1599b6 [Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373)
Summary:
## No Functional Change
- Refactor Subprocess Handler into a separate folder for easier subclassing
- SubprocessHandler
    - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class
    - pass in `local_rank_id` from subprocess start

Test Plan: No functional changes.

Differential Revision: D54038627

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373
Approved by: https://github.com/kurman
2024-03-08 01:37:34 +00:00
c86a1ce125 [dynamo][guards-cpp-refactor] Func defaults and kwdefaults accessor (#121338)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121338
Approved by: https://github.com/jansel
ghstack dependencies: #121327
2024-03-08 01:24:00 +00:00
79a04f2df9 [dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121327
Approved by: https://github.com/jansel
2024-03-08 01:24:00 +00:00
962c1b4c69 Update XNNPACK revision to fcbf55a (#120583)
Update XNNPACK dependency to revision fcbf55a. This is part of a larger, synchronized update of the dependency version for PyTorch, ExecuTorch, and FB internal targets.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120583
Approved by: https://github.com/mcr229
2024-03-08 01:19:22 +00:00
090616d9a1 [Indutor] Support auto-tuned custom PT ops in abi compatible mode (#120877)
Differential Revision: D54344556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120877
Approved by: https://github.com/aakhundov
2024-03-08 01:16:57 +00:00
04a5d6e8d3 [dynamo][guards] Use lazy variable tracker for func defaults (#121388)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388
Approved by: https://github.com/jansel
2024-03-08 01:10:46 +00:00
5d8e4126b6 Fixup test_trace_rules (#121351)
Summary:
Fixes
https://www.internalfb.com/intern/testinfra/diagnostics/7599824578133672.281475099376195.1709732674/

(for some reason this test didn't run in OSS)?

Reached out to Yanbo Liang for additional context:
 {F1465435684}

Test Plan:
Local:
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16325548673376150/

Differential Revision: D54605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121351
Approved by: https://github.com/malfet, https://github.com/yanboliang
2024-03-08 00:50:45 +00:00
af62a70fab [export] Fix nn_module_stack in retracing (#121423)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1391916691446538/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121423
Approved by: https://github.com/zhxchen17
2024-03-08 00:34:11 +00:00
4f120dc2a6 Clean up mode handling in python dispatcher (#121083)
Things that were bad before this PR:
1. Temporarily unsetting functional tensor mode and proxy mode both had duplicate implementation
2. There are variants of mode handling private utils that has duplicate implementation. (different APIs calling repeated implementation, so i refactored)
3. _push_mode API used to take dispatch key argument which is not necessary.
4. There are unused APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121083
Approved by: https://github.com/zou3519
2024-03-08 00:30:34 +00:00
0811f15270 [DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-03-08 00:24:29 +00:00
f76e541ec7 [BE] NO MORE discrepancy between forloop foreach capturable YAY (#121269)
and I will not let it happen again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121269
Approved by: https://github.com/albanD
ghstack dependencies: #121260, #121264
2024-03-08 00:00:30 +00:00
9d6c5be781 Add ASGD capturable API for forloop (#121264)
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.

This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.

There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though.  ¯\_(ツ)_/¯

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260
2024-03-08 00:00:30 +00:00
24821fec26 Add RAdam capturable API for forloop (#121260)
Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260
Approved by: https://github.com/mlazos
2024-03-08 00:00:30 +00:00
b1657beac1 feat: Add min, max ranges to mark_dynamic API (#119737)
Fixes https://github.com/pytorch/pytorch/issues/115137

This PR adds:

- mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim.
- test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-03-07 23:26:03 +00:00
e0c534fe02 Revert "[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590)"
This reverts commit 156954d6a2a05f3ce8288dd054691102e596e461.

Reverted https://github.com/pytorch/pytorch/pull/105590 on behalf of https://github.com/ezyang due to https://github.com/pytorch/pytorch/issues/121288#issuecomment-1981980699 ([comment](https://github.com/pytorch/pytorch/pull/105590#issuecomment-1984745827))
2024-03-07 23:06:29 +00:00
3d089de851 Add torch.cond support to AOT Inductor (#121120)
Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).

Notable additions:

- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).

- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.

- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).

- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.

Codegen examples from the new unit tests:

- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-07 22:39:57 +00:00
26740f853e Remove unnecessary use of ctx.resolve_tools. (#120493)
In this case, it's simpler to use ctx.actions.run(executable = ...), which already ensures that the runfiles associated with the executable are present.

(It's also possible to use ctx.actions.run_shell(tools = ...) with a custom command line, but it's unclear to me that indirecting through the shell is needed here.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120493
Approved by: https://github.com/ezyang
2024-03-07 22:33:17 +00:00
d14d62b7aa [dynamo] add more refleak tests (#120657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120657
Approved by: https://github.com/jansel
2024-03-07 22:25:43 +00:00
6490441d8f Remove dead get_shape_groups (#120813)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120813
Approved by: https://github.com/albanD
2024-03-07 22:20:30 +00:00
18d574a07a [Inductor] Use indices for constants in triton_meta (#121427)
@bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true 392370b303/python/triton/runtime/jit.py (L381-L385)
I'm guessing both indices and names work here but lets be consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427
Approved by: https://github.com/aakhundov
2024-03-07 21:59:43 +00:00
f61192b014 Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121428
Approved by: https://github.com/yifuwang
2024-03-07 21:29:25 +00:00
76f1461892 [export] Serialize union fields with single entry dict. (#121263) (#121337)
Summary:

remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly.

bypass-github-export-checks

Test Plan: CI

Differential Revision: D54600943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121337
Approved by: https://github.com/tugsbayasgalan
2024-03-07 21:24:28 +00:00
4c58f2b675 [PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335)
We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity.

Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335
Approved by: https://github.com/Skylion007
2024-03-07 21:15:05 +00:00
ea8f6e2e54 Subclass view fake-ification via reified ViewFuncs (#118405)
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
    * subclass -> dense
    * dense -> subclass
    * subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available

Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
2024-03-07 19:56:16 +00:00
63ec5cd158 TD Heuristic for tests mentioned in PR body, less verbose TD printing (#120621)
Move tests that are mentioned in PR body or commit message to front.  Also attempts to find any issues/PRs mentioned in the PR body and search for those too (ex if you link a disable issue and that issue contains the test file that it was failing on)

looking for: dynamo/test_export_mutations

Also removes some printed information in TD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120621
Approved by: https://github.com/osalpekar
2024-03-07 19:36:11 +00:00
c7a65f58b0 [CI] Script to fetch creds from current AWS session (#121426)
Because some implementations, like OpenDAL does not work with AWS IMDSv2, but this script will bridge the gap and enables more recent `sccache` releases(that switched from simple-s3 to OpenDAL) to work in current CI system

When launched it prints something like:
```
export AWS_ACCESS_KEY_ID=XXXXX
export AWS_SECRET_ACCESS_KEY=YYYY
export AWS_SESSION_TOKEN=ZZZZ
```
which can be `eval`ed and passed then sccache can use those failures.

Validated in https://github.com/pytorch/pytorch/pull/121323
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121426
Approved by: https://github.com/Skylion007
2024-03-07 19:25:54 +00:00
2b1661c7a0 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit 05c256849b464deee16ccd70152fd54071c6c79c.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))
2024-03-07 18:53:51 +00:00
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
83d095c213 [BE] Remove unnecessary requires_cuda in common_optimizers.py (#121249)
@mlazos had already added the needed decorator on the test itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121249
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/albanD
ghstack dependencies: #121183
2024-03-07 17:57:02 +00:00
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00
af88425cdc Forward fix lint after 121202 (#121425)
Forward fix after #121202, where the lintrunner job failed due to being unable to checkout the pytorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121425
Approved by: https://github.com/ezyang, https://github.com/aakhundov, https://github.com/malfet
2024-03-07 17:20:13 +00:00
suo
c3c15eb9a6 [export] update docs to not export raw functions (#121272)
as title

Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272
Approved by: https://github.com/zhxchen17
2024-03-07 17:18:07 +00:00
862b99b571 Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 3239f86a3df133b5977d988324639e0de7af8749.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/malfet due to Breaks internal tests, likely due to the increased memory requirements ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1983875400))
2024-03-07 16:16:07 +00:00
eea37c6db4 [profiler] record nccl version in distributed info (#121044)
Summary: Add a field of NCCL version in distributed info if backend is NCCL

Differential Revision: D54432888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121044
Approved by: https://github.com/aaronenyeshi
2024-03-07 15:56:02 +00:00
cyy
3aa512cd72 [Clang-tidy header][23/N] Enable clang-tidy coverage on aten/src/ATen/*.{cpp,h} (#121380)
This PR finishes the works beginning with #https://github.com/pytorch/pytorch/pull/120763 by enabling clang-tidy on aten/src/ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121380
Approved by: https://github.com/Skylion007
2024-03-07 15:11:07 +00:00
9a45001905 [dynamo] relax missing symbols runtime assert (#121339)
Differential Revision: [D54603361](https://our.internmc.facebook.com/intern/diff/D54603361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121339
Approved by: https://github.com/ezyang
2024-03-07 14:53:38 +00:00
0339f1ca82 [Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310)
Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard.

Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310
Approved by: https://github.com/chenyang78
ghstack dependencies: #121309
2024-03-07 14:24:21 +00:00
7e598c0053 [Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309)
Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309
Approved by: https://github.com/chenyang78
2024-03-07 14:22:06 +00:00
57fc35a3af [ Inductor ] Shape padding honors output stride preservation (#120797)
This fix makes sure that shape padding honors inductors 'keep_output_strides' setting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120797
Approved by: https://github.com/eellison
2024-03-07 13:52:29 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
1ce5049692 [inuctor] fix the layout problem for nll_loss2d_backward (#121173)
Fixes https://github.com/pytorch/pytorch/issues/120759 .

The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op.

Not sure if we can improve the cuda kernel to release the constraints though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-07 09:05:07 +00:00
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
e8e3049f57 [FSDP2] Relaxed check for parent mesh (#121360)
Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #120351, #121328
2024-03-07 08:09:25 +00:00
db36d21f5c Add SDPA pattern for HuggingFace models BF16 (#121202)
### Description

- Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM)
- Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny)

### Newly matched models
Dtype: bf16, machine: SPR

#### Dynamo HuggingFace models

- ElectraForCausalLM (speedup=2.09x)
- ElectraForQuestionAnswering (speedup=4.22x)
- AlbertForQuestionAnswering (speedup=1.36x)
- AlbertForMaskedLM (speedup=1.39x)

#### OOB HuggingFace models

- multiple-choice+google-electra-base-discriminator
- text-classification+prajjwal1-bert-tiny
- text-classification+prajjwal1-bert-mini
- text-classification+google-electra-base-generator
- text-classification+bert-large-cased
- casual-language-modeling+xlm-roberta-base
- text-classification+roberta-base
- text-classification+xlm-roberta-base
- text-classification+albert-base-v2
- token-classification+google-electra-base-generator
- masked-language-modeling+bert-base-cased

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-07 07:40:00 +00:00
953c6c37cb Wrap remote cache creation with a try-catch (#121340)
Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch.

Test Plan: CI

Differential Revision: D54604339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340
Approved by: https://github.com/aakhundov
2024-03-07 07:05:49 +00:00
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
f848e9c646 [Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984)
Fixes #120869

Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point.
Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32.

**Test plan**
python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-07 06:23:52 +00:00
4f9d4e1ab0 [DTensor][XLA] refactor DTensor _xla API (#113214)
In response to the change pytorch/xla#5776 and #92909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214
Approved by: https://github.com/wanchaol
2024-03-07 06:18:05 +00:00
cyy
c723514ef4 [CUDACachingAllocator] Simplify update_stat and avoid casts (#120964)
update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964
Approved by: https://github.com/albanD
2024-03-07 05:55:38 +00:00
55232c4e1c Make CausalBias a torch.Tensor subclass again (#121358)
# Summary
This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-03-07 05:20:47 +00:00
df2ad1fecc [dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216)
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
2024-03-07 04:50:15 +00:00
77873f6fe5 [dtensor][1/N] add torchrec even row-wise sharding example (#120260)
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.

This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.

**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
2024-03-07 04:50:15 +00:00
9cc0f23e5c [dtensor][debug] allow visualize_sharding to print header (#121179)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179
Approved by: https://github.com/wanchaol
2024-03-07 04:50:06 +00:00
a2854ae904 Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464)
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.

In the original code, ``_metadata`` was handled as a ``key``.

```
    # also strip the prefix in metadata if any.
    if "_metadata" in state_dict:
```

This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:

```
    # also strip the prefix in metadata if any.
    if hasattr(state_dict, "_metadata"):
```

This PR also includes the necessary test.

Fixes #106942

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
2024-03-07 04:00:49 +00:00
edd80f87b8 Prevent infinite recursion within Tensor.__repr__ (#120206)
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).

The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.

Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
2024-03-07 02:24:45 +00:00
eb4d87f237 graph break on sparse tensors constructions (#120458)
Fix some tests in https://github.com/pytorch/pytorch/issues/119780
sparse_bsc_tensor is not supported
https://github.com/pytorch/pytorch/pull/117907

Also more about the issue here.
https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458
Approved by: https://github.com/ezyang
2024-03-07 02:17:41 +00:00
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
b9087f8571 [profiler] Add execution_trace_observer as an optional argument to profiler (#119912)
# Update Profiler API to collect Execution Traces

## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch

def main():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        …
        excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
    ) as prof:
        ...
        prof.step()
```

See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.

## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads.  It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)

Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]

## Why correlate Execution Trace with PyTorch/Kineto Trace

Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.

## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section

# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.

```
pytest test/profiler/test_profiler.py  -k test_execution_trace_with_kineto -rP

Running 1 items in this shard

test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-03-07 01:30:26 +00:00
eb1145436a [DCP] Adds main in format utils (#120128)
Adds main in format utils. Usage:

`python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt`

or

`python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir`

Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128
Approved by: https://github.com/fegin, https://github.com/wz337
2024-03-07 01:18:17 +00:00
cyy
5cc511f72f Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123)
This PR follows the suggestions in #121066 and changes most loops to c10::irange.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123
Approved by: https://github.com/soulitzer
2024-03-07 00:11:27 +00:00
c268ce4a6d Make ATen-cpu cuda/rocm agnostic (#121082)
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.

Test Plan: sandcastle + oss ci

Differential Revision: D54453492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
2024-03-06 23:51:40 +00:00
e50ded03a6 Use type check for also is_not (#113859)
Handle `is_not` for:

9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)

I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859
Approved by: https://github.com/Skylion007
2024-03-06 23:12:42 +00:00
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
2f064d895c Switch TORCH_TRACE to accept a directory by default (#121331)
Directory is better because it works smoothly with distributed
runs; otherwise you'd need to modify torchrun to setup distinct
log names for each file.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331
Approved by: https://github.com/albanD
2024-03-06 22:46:18 +00:00
372f192050 [DTensor] Initialized RNG tracker if needed (#121328)
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).

```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
2024-03-06 22:21:44 +00:00
b0e2ed4d67 removing some macros (#120314)
Summary: Will be making some changes in the surrounding code, they are going to be easier without macros

Differential Revision: D54001770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314
Approved by: https://github.com/zhxchen17
2024-03-06 22:06:05 +00:00
69cedc16c5 Add padding dimension checks and tests (#121298)
Fixes #121093

Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```

To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
2024-03-06 21:55:34 +00:00
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
3fee05f242 Triage the remaining fallbacks (#121312)
Building off work from @amjames. There may be some missclassifications, feel free to flag them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121312
Approved by: https://github.com/jansel
2024-03-06 21:23:47 +00:00
e865700f6a [FSDP2] Added initial meta-device init support (#120351)
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.

We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor

We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.

```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
  model = ...
  parallelize_module(model, tp_mesh, ...)
  fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
  assert param.device.type == "meta"

model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
  if hasattr(module, "reset_parameters"):
    module.reset_parameters()
```

This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
2024-03-06 21:18:25 +00:00
3cf02c5e06 [Dev Container] Fix container build by preventing conda prompt (#121128)
Without this the build will freeze with prompt:
  Proceed ([y]/n)?

I'm using rootless podman in vscode instead of docker but I think it should not affect this.
..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything.

Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128
Approved by: https://github.com/drisspg
2024-03-06 20:50:40 +00:00
58ac4a2007 Remove llava from ci_expected_accuracy as it's flaky (#121322)
https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322
Approved by: https://github.com/desertfire
2024-03-06 20:47:01 +00:00
23fb37fa41 Revert "[export] Serialize union fields with single entry dict. (#121263)"
This reverts commit 7feabe9b73e6ba7724b62ea91df27049defdf378.

Reverted https://github.com/pytorch/pytorch/pull/121263 on behalf of https://github.com/osalpekar due to A large number of inductor benchmarking jobs failing starting this PR. See for details: 7feabe9b73 ([comment](https://github.com/pytorch/pytorch/pull/121263#issuecomment-1981680049))
2024-03-06 19:58:55 +00:00
76f3663efe Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156)
…unsupported dtype.

Fixes #121138.

The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-03-06 19:37:38 +00:00
360761f7d0 [Torchelasic] Create root log directory by default (#121257)
Summary:
After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent.

Reverting the behavior to:
- making tempdir when log dir is not specified
- allowing non-empty root log dir
    - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294

Differential Revision: D54531851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257
Approved by: https://github.com/d4l3k
2024-03-06 18:50:38 +00:00
418568d2e3 Add Float8 support to onnx exporter (#121281)
Fixes #106877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281
Approved by: https://github.com/BowenBao, https://github.com/titaiwangms
2024-03-06 18:46:56 +00:00
cyy
5a2527db22 [Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#121102)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120763.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102
Approved by: https://github.com/Skylion007
2024-03-06 18:36:31 +00:00
c5ef4df274 guard on grads being None in compiled optimizers (#121291)
Fixes #115607

We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-06 18:33:23 +00:00
7feabe9b73 [export] Serialize union fields with single entry dict. (#121263)
Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly.

Test Plan: CI

Differential Revision: D54553770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121263
Approved by: https://github.com/tugsbayasgalan
2024-03-06 18:16:16 +00:00
c66d68ba51 [PT2] Add tolist() to FunctionalTensor for torch.export (#121242)
Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242
Approved by: https://github.com/ezyang
2024-03-06 18:10:44 +00:00
05c256849b [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-06 18:01:56 +00:00
b27d76949b [ROCm] Enable several fake_crossref UTs on ROCm (#121112)
Enabled unit tests:

test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_subgradients_at_zero_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_norm_subgradients_at_zero_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_norm_nuc_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_svd_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_svd_cuda_float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121112
Approved by: https://github.com/ezyang
2024-03-06 17:36:47 +00:00
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
8dd4b6a78c Fix venv compatibility issue by updating python_lib_path (#121103)
Reference by sys.executable is the absolute path of the executable binary for the Python interpreter, which may not be appropriate. Instead, sys.base_exec_prefix is more suitable, and this change will correctly resolve the library when using venv. I have tested it with a venv created by rye.

https://docs.python.org/3.6/library/sys.html#sys.executable

> A string giving the absolute path of the executable binary for the Python interpreter, on systems where this makes sense. If Python is unable to retrieve the real path to its executable, [sys.executable](https://docs.python.org/3.6/library/sys.html#sys.executable) will be an empty string or None.

https://docs.python.org/3.6/library/sys.html#sys.exec_prefix

> A string giving the site-specific directory prefix where the platform-dependent Python files are installed; by default, this is also '/usr/local'. This can be set at build time with the --exec-prefix argument to the configure script. Specifically, all configuration files (e.g. the pyconfig.h header file) are installed in the directory exec_prefix/lib/pythonX.Y/config, and shared library modules are installed in exec_prefix/lib/pythonX.Y/lib-dynload, where X.Y is the version number of Python, for example 3.2.

https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix

> Set during Python startup, before site.py is run, to the same value as [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix). If not running in a [virtual environment](https://docs.python.org/3.6/library/venv.html#venv-def), the values will stay the same; if site.py finds that a virtual environment is in use, the values of [prefix](https://docs.python.org/3.6/library/sys.html#sys.prefix) and [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix) will be changed to point to the virtual environment, whereas [base_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_prefix) and [base_exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix) will remain pointing to the base Python installation (the one which the virtual environment was created from).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121103
Approved by: https://github.com/ezyang
2024-03-06 17:00:46 +00:00
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
54d92f2e37 Add jacrev support in torch.compile (#121146)
Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146
Approved by: https://github.com/zou3519
2024-03-06 16:05:33 +00:00
49d1fd31cf Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) (#120077)
Description:
- PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.
- this should influence only cpu device

Example:
```python
from unittest.mock import patch
import torch
from torch._inductor.graph import GraphLowering
from torch._inductor import config

# Force multple scheduler nodes creation to fuse them
config.realize_opcount_threshold = 1

@torch.compile(fullgraph=True, dynamic=True)
def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    o1 = x * w1.view(1, 1, 1, -1)
    o2 = x * w2.view(1, 1, 1, -1)
    output = o1 + o2
    return output

in_nodes = []
outputs = []
run_node = GraphLowering.run_node

graph_lowering_obj = None

def run_node_alt(self, n):
    global graph_lowering_obj

    graph_lowering_obj = self
    in_nodes.append(n)
    output = run_node(self, n)
    outputs.append(output)

    return output

x = torch.rand(1, 3, 32, 32)
w1 = torch.randn(32)
w2 = torch.randn(32)

with patch.object(GraphLowering, "run_node", run_node_alt):
    fn(x, w1, w2)

print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers)
print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes)
```

Output on `main`:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')]
```
Output on this PR:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)]
```

Context:
While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10
2024-03-06 12:19:45 +00:00
aa0b0944d5 [dynamo] Re-dispatch torch.Tensor.new into torch.Tensor.new_empty method. (#121075)
Fix: https://github.com/pytorch/xla/issues/6009

This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.

Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:

```python
@torch.compile(backend="openxla")
def foo(x):
    new_x = x.new(*x.size())

    # new_x.device() == "xla"
    # x.device() == "xla:0"

    return new_x + x

a = torch.arange(10)
foo(a.to(xm.xla_device()))
```

Resulting in the following error:

```python
Traceback (most recent call last):
  ...
  File "torch/_dynamo/utils.py", line 1654, in get_fake_value
    ret_val = wrap_fake_exception(
  File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
    return fn()
  File "torch/_dynamo/utils.py", line 1655, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "torch/_dynamo/utils.py", line 1776, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "torch/_dynamo/utils.py", line 1758, in run_node
    return node.target(*args, **kwargs)
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
    return self.wrap_meta_outputs_with_default_device_logic(
  File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
    return tree_map(wrap, r)
  File "torch/utils/_pytree.py", line 900, in tree_map
    return treespec.unflatten(map(func, *flat_args))
  File "torch/utils/_pytree.py", line 736, in unflatten
    leaves = list(leaves)
  File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
    ) = FakeTensor._find_common_device(func, flat_args)
  File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
    merge_devices(arg)
  File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
    raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```

Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel
2024-03-06 11:49:27 +00:00
e3bd6efe72 [dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147, #121154
2024-03-06 08:36:45 +00:00
b6b2d5b00a [dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147
2024-03-06 08:36:45 +00:00
52d89d8491 [dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147
Approved by: https://github.com/jansel
ghstack dependencies: #121121
2024-03-06 08:36:45 +00:00
af7f55ffc8 [dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121
Approved by: https://github.com/jansel
2024-03-06 08:36:45 +00:00
0b9bfcf9bb [non-strict export] support tensor attribute without other args (#121176)
Summary: Without args we have a hard time detecting fake modes. This causes a fake mode mismatch error in non-strict (specifically, `aot_export_module`) when the module contains tensor attributes, because we create a fresh fake mode when we cannot detect one. The fix is to pass the same fake mode throughout.

Test Plan: added test

Differential Revision: D54516595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121176
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-03-06 08:10:00 +00:00
8087912622 Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)"
This reverts commit 0ab2ec37383e44fa00c520de6e2b40845fccc6f3.

Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))
2024-03-06 06:39:51 +00:00
099ff51d45 torch check the division by zero in batch_norm_update_stats (#120882)
Fixes #120803

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882
Approved by: https://github.com/CaoE, https://github.com/malfet
2024-03-06 05:40:21 +00:00
2eec0e7c5f [BE] Remove __iniline__ from __global__ (#121246)
in layer_norm_kernel.cu since the qualifier seems to be ignored according to:

```
[18/263] Building CUDA object
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function

Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"

/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function

Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-06 05:16:52 +00:00
31bfa59970 Capture primitive data type arguments for profiling python_function (#120949)
RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive.

This PR is to support primitive (or its container) arguments in RECORD_FUNCTION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949
Approved by: https://github.com/soulitzer
2024-03-06 05:09:22 +00:00
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
f72eb5ae4c __grid__constant is only suported on cuda version >= 11.8 (#121275)
Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8.

Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding

Differential Revision: D54556796

Co-authored-by: Driss Guessous <drisspg@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275
Approved by: https://github.com/drisspg
2024-03-06 03:44:59 +00:00
dad1b76584 Introduce EphemeralSource for symbols that should be simplified out (#120948)
Context: view fake-ification should handle closed-over state in ViewFuncs for use in view replay by:
* fake-ifying tensors
* symbolicizing SymInts

This avoids invalid specialization during view replay. However, the symbols / tensors created as intermediates in the view chain should not stick around or be guarded on. This PR introduces an `EphemeralSource` intended to be used as a source for this purpose. It has the following properties:
* Considered first to be simplified out in symbol simplification logic
* Errors if guarded on

Differential Revision: [D54561597](https://our.internmc.facebook.com/intern/diff/D54561597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120948
Approved by: https://github.com/ezyang
2024-03-06 02:30:52 +00:00
d968fc442b [FSDP] restore fully_shard after exit from mock.patch (#121058)
manually restore fully_shard after \_\_exit\_\_ from mock.patch ctx. This will fix flaky CIs in trunk
```
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py
```

this is a workaround to make mock.patch(fully_shard) work with multi-thread
* thread 1 set func.\_\_module\_\_[fully_shard] = patched function
* thread 2 read func.\_\_module\_\_[fully_shard], thought it is original and fail to restore fully_shard during \_\_exit\_\_
* this PR manually restore fully_shard after \_\_exit\_\_

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121058
Approved by: https://github.com/awgu
2024-03-06 02:14:59 +00:00
eqy
8dafc81ba9 [cuBLAS][cuBLASLt] Fix expected failures for int_mm on sm75 (turing) (#121277)
CC @malfet @atalman @ptrblck @tinglvv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121277
Approved by: https://github.com/malfet
2024-03-06 01:51:01 +00:00
ce6a7d56fc Don't merge qnnpack (#120676)
Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack.

Test Plan:
1. Measure the binary size impact
1. Release build failed previously; now it should succeed

Differential Revision: D54048156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676
Approved by: https://github.com/kimishpatel
2024-03-06 01:42:13 +00:00
4b3903379a Add assign argument to torch.Tensor.module_load (#121158)
Make `torch.__future__.get_swap_module_params_on_conversion() == True` account for `assign` argument to `nn.Module.load_state_dict`

Similar to when `torch.__future__.set_swap_module_params_on_conversion()` is `False`, `assign=True` means that we do not incur a `self.copy_(other)` and the properties of `other` will be preserved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121158
Approved by: https://github.com/albanD
ghstack dependencies: #121157
2024-03-06 01:32:06 +00:00
27389e03f0 [easy] Fixed requires_grad preservation for nn.Module.load_state_dict(assign=True) (#121157)
Always preserve requires_grad of param in module. Documentation fixed in PR stacked above.
Also fix test case to test load a state_dict generated with `keep_vars=False` (the default)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121157
Approved by: https://github.com/albanD
2024-03-06 01:32:06 +00:00
87a533ed1b c10:intrusive_ptr, self assignment (#119275)
Summary:
In C++ books/sources, self assignment check is often considered a bad practice, since it is very very unlikely.

See, for example libc++ doesn't have it:
cf94e0082e/libcxx/include/__memory/shared_ptr.h (L651)

How about we remove it?

Test Plan:
This check is like 1% of cycles assinged to intrusive_ptr::operator=
https://fburl.com/scuba/strobelight_services/9qqnrkdn

This is not a lot in purely cycles but since it's gpu machines, can be substantial

Differential Revision: D53471639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119275
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-03-06 01:11:56 +00:00
412c687e2e Fix permuted sum precision issue for lower precision on CPU (#108559)
Fixes #83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10
2024-03-06 01:01:35 +00:00
34e3f6f3c9 fix segfault in torch.native_channel_shuffle when input is empty (#121199)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

fix https://github.com/pytorch/pytorch/issues/121092

`torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel.

* __->__ #121199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199
Approved by: https://github.com/malfet
2024-03-06 00:46:36 +00:00
8473cd92e4 remove compute capability 3.5 for CUDA 12 (#114930)
CUDA 12 has removed compute capability 3.5. NVCC throws the error: `nvcc fatal   : Unsupported gpu architecture 'compute_35'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114930
Approved by: https://github.com/malfet
2024-03-06 00:40:57 +00:00
d13ed8503c CI: Add aarch64 docker build and ciflow tags (#120931)
adding workflows for aarch64 linux docker build with ACL installed as system dependency

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120931
Approved by: https://github.com/atalman, https://github.com/malfet
2024-03-06 00:31:22 +00:00
cac36e232e [PyTorch] Split StaticModule out of test_static_runtime (#121028)
I want to use StaticModule in another (internal) test, so splitting it out.

Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028
Approved by: https://github.com/suo
2024-03-05 23:14:07 +00:00
f5391dad82 Update docs to point to new sdpa_kernel context manager (#121180)
# Summary

Updates the SDPA docs to fix some small inaccuracies and points to the new sdpa_kernel context manger. The Enum like type binded from cpp SDPBackend does not render its fields for some reason. Manually list them instead for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121180
Approved by: https://github.com/mikaylagawarecki
2024-03-05 22:19:48 +00:00
8bb3e0b643 [pytorch] Name the main and autograd threads for better debugging (#121170)
The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads.

I used Kineto traces to verify if the thread names were propagated:

<img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f">

Also:

```
nvidia-smi
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3065920      C   ...me#python#py_version_3_10     1968MiB |
|    1   N/A  N/A   3065926      C   ...me#python#py_version_3_10     1978MiB |
|    2   N/A  N/A   3065930      C   ...me#python#py_version_3_10     2084MiB |
|    3   N/A  N/A   3065936      C   ...me#python#py_version_3_10     2016MiB |
|    4   N/A  N/A   3065939      C   ...me#python#py_version_3_10     1998MiB |
|    5   N/A  N/A   3065943      C   ...me#python#py_version_3_10     2070MiB |
|    6   N/A  N/A   3065948      C   ...me#python#py_version_3_10     2026MiB |
|    7   N/A  N/A   3065952      C   ...me#python#py_version_3_10     2070MiB |
+-----------------------------------------------------------------------------+
[me@myhost ~]$ ps -T -p 3065920
    PID    SPID TTY          TIME CMD
3065920 3065920 pts/14   00:01:04 pt_main_thread
...
3065920 3092181 pts/14   00:00:40 pt_autograd_d0
3065920 3092182 pts/14   00:00:00 pt_autograd_d1
3065920 3092183 pts/14   00:00:00 pt_autograd_d2
3065920 3092184 pts/14   00:00:00 pt_autograd_d3
3065920 3092185 pts/14   00:00:00 pt_autograd_d4
3065920 3092186 pts/14   00:00:00 pt_autograd_d5
3065920 3092187 pts/14   00:00:00 pt_autograd_d6
3065920 3092188 pts/14   00:00:00 pt_autograd_d7
...

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170
Approved by: https://github.com/albanD
2024-03-05 22:15:39 +00:00
24944f6717 [doc] Fix math display in ChannelShuffle doc (#121247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121247
Approved by: https://github.com/mikaylagawarecki
2024-03-05 21:30:51 +00:00
b3a9d677a3 [ez] Add super() calls in test_custom_ops (#121239)
Some disable issues are getting spammed
Check that test_impl_invalid_devices gets skipped by the disable issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121239
Approved by: https://github.com/zou3519
2024-03-05 21:16:06 +00:00
34a28f01dd [Autograd] Improve error for leaf tensors as out argument to fallback (#121089)
Closes  #120988

Currently operators that hit the autograd fallback call `check_inplace`
on all mutated inputs, including out arguments. This leads to a slightly
confusing error message:
```
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```

Compared to functions that don't fallback, which raise
```
RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
```

This changes the error message to make clear the issue is with the out argument,
but does not tighten the check to outright ban out arguments that require grad.
Instead, I use the same checks from `check_inplace` which allows non-leaf tensors
that require grad to pass without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089
Approved by: https://github.com/lezcano, https://github.com/soulitzer
ghstack dependencies: #121142
2024-03-05 21:13:27 +00:00
eae9751e82 Fix linalg_eigvals invalid use of composite dispatch key (#121142)
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
2024-03-05 21:13:27 +00:00
393b4ab432 Fixes issue_119785 (#121048)
Fixes #ISSUE_119785

- Removed all sentinel files of `test_causal_variants_.*`.

- The `test_causal_variants_causal_variant_` tests could pass after removing the dynamo_skips files.

- The `test_causal_variants_compile_causal_variant` fails with `PYTORCH_TEST_WITH_DYNAMO=1`. These tests already call torch.compile, so added @skipIfTorchDynamo to skip them for `PYTORCH_TEST_WITH_DYNAMO`.

**Tests**
```
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test_transformers.py -v -k "test_causal_variants"
================================================================== test session starts ==================================================================
platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python
cachedir: .pytest_cache
rootdir: /data/users/shuqiyang/pytorch
configfile: pytest.ini
collected 77250 items / 77218 deselected / 32 selected
Running 32 items in this shard

test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.7745s]                  [  3%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.8020s]                  [  6%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0385s] (Lower righ...) [  9%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.5046s]                  [ 12%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.6483s]                   [ 15%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.8537s]                   [ 18%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.8388s]                   [ 21%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.4859s]                   [ 25%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu SKIPPED [0.0084s] (Th...) [ 28%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu SKIPPED [0.0086s] (Th...) [ 31%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0081s] (Th...) [ 34%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu SKIPPED [0.0085s] (Th...) [ 37%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu SKIPPED [0.0082s] (Thi...) [ 40%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu SKIPPED [0.0085s] (Thi...) [ 43%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu SKIPPED [0.0081s] (Thi...) [ 46%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu SKIPPED [0.0085s] (Thi...) [ 50%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.4185s]                [ 53%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.4273s]                [ 56%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0280s] (Lower ri...) [ 59%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [8.0999s]                [ 62%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.3785s]                 [ 65%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.3818s]                 [ 68%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.3864s]                 [ 71%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.7668s]                 [ 75%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda SKIPPED [0.0089s] (...) [ 78%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda SKIPPED [0.0087s] (...) [ 81%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0087s] (...) [ 84%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda SKIPPED [0.0084s] (...) [ 87%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda SKIPPED [0.0087s] (T...) [ 90%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda SKIPPED [0.0087s] (T...) [ 93%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda SKIPPED [0.0084s] (T...) [ 96%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda SKIPPED [0.0087s] (T...) [100%]

=================================================== 14 passed, 18 skipped, 77218 deselected in 39.72s ===================================================
```
```
$ pytest test_transformers.py -v -k "test_causal_variants"
================================================================== test session starts ==================================================================
platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python
cachedir: .pytest_cache
rootdir: /data/users/shuqiyang/pytorch
configfile: pytest.ini
collected 77250 items / 77218 deselected / 32 selected
Running 32 items in this shard

test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.2410s]                  [  3%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.3984s]                  [  6%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lower righ...) [  9%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.0095s]                  [ 12%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.1749s]                   [ 15%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.2138s]                   [ 18%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.2715s]                   [ 21%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0108s]                   [ 25%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.4864s]          [ 28%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.5346s]          [ 31%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lo...) [ 34%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.1722s]          [ 37%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.2341s]           [ 40%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.4786s]           [ 43%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.4635s]           [ 46%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0861s]           [ 50%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.7579s]                [ 53%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.0044s]                [ 56%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0007s] (Lower ri...) [ 59%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [9.2065s]                [ 62%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0081s]                 [ 65%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0063s]                 [ 68%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0059s]                 [ 71%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.0055s]                 [ 75%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [0.1200s]        [ 78%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.1032s]        [ 81%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0010s] (...) [ 84%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [0.1151s]        [ 87%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0705s]         [ 90%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0713s]         [ 93%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0696s]         [ 96%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.1516s]         [100%]

=================================================== 28 passed, 4 skipped, 77218 deselected in 39.23s ====================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121048
Approved by: https://github.com/zou3519
2024-03-05 20:19:02 +00:00
8ccf8b2c47 Avoid COW input materialize in more forward ops (#121070)
Affected operators are: addr, cdist, sparse.sampled_addm, sparse.mm,
matrix_exp, softmax, cross_entropy

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121070
Approved by: https://github.com/ezyang
2024-03-05 19:47:24 +00:00
81dbc487c7 ci: add "typing_extensions" package to ci requirements list (#121136)
this is required for torchgen

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121136
Approved by: https://github.com/malfet, https://github.com/atalman
2024-03-05 18:26:01 +00:00
3239f86a3d [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)
According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-05 18:13:05 +00:00
8aeb247a3d [export] Remove WrapperModule. (#121042)
Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants.

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D54326331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042
Approved by: https://github.com/angelayi
2024-03-05 18:10:22 +00:00
0e604becc5 [NJT] support chunk on batch dim (#119713)
- support chunk op on batch dim
- support empty_like op
- add tests for the like ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119713
Approved by: https://github.com/jbschlosser
2024-03-05 17:57:50 +00:00
ae4c85960f Add Deberta pass (#121206)
Adding DebertaForQuestionAnswering to inductor benchmark pass, as it did not show up before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121206
Approved by: https://github.com/desertfire
2024-03-05 17:56:25 +00:00
5abf7972d1 [DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378)
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.

This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.

**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5

cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)

# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)

# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)

# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)

# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```

Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
2024-03-05 17:48:15 +00:00
cyy
6ecd65886a Remove unnecessary const_casts (#121225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225
Approved by: https://github.com/soulitzer
2024-03-05 17:34:24 +00:00
85c807b3fd [export] Ensure optional fields always have default value. (#121163)
Summary: Add additional check to make sure we can always unset an optional field.

Test Plan: CI

Differential Revision: D54504243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121163
Approved by: https://github.com/tugsbayasgalan
2024-03-05 17:16:49 +00:00
35004b8ab4 [dynamo] Fix handling of invalid args (#121110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121110
Approved by: https://github.com/yanboliang
ghstack dependencies: #121106
2024-03-05 17:16:04 +00:00
4f19b5f7ef [dynamo] Remove extra guard for tensor constant attrs (#121106)
Also deletes some unused code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121106
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-03-05 17:16:04 +00:00
e4352182bd Disable remote cache test on ROCM (#121210)
Fixes #121194
Fixes #121166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121210
Approved by: https://github.com/aakhundov
2024-03-05 16:35:40 +00:00
f25a25fde5 Fix lintrunner-noclang (#121205)
Fix lintrunnner-noclang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121205
Approved by: https://github.com/Skylion007
2024-03-05 16:18:36 +00:00
fbf36d01a0 Update Triton (#119457)
Fix pytorch nightly compilation for cuda linking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457
Approved by: https://github.com/lezcano
2024-03-05 15:04:12 +00:00
59d9f1e227 Spectral norm value test (#121068)
Spectral norm implementation has extensive tests, but there doesn't appear to be any checking that indeed the spectral norm (= top singular value) is correctly calculated. There should at least be one such testcase.

This adds one such testcase for the parameterizations.py implementation of spectral norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121068
Approved by: https://github.com/soulitzer
2024-03-05 14:46:31 +00:00
d621e3e3b8 Add exhaustive module and optimizer tests for torch.load(state_dict, weights_only=True) (#121049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121049
Approved by: https://github.com/janeyx99
2024-03-05 14:27:50 +00:00
42821d462a [ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746)
There will be some changes in CUDA 12.4 that would require smaller number of threads per block with double precision in `ctc_loss`. This PR addresses the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120746
Approved by: https://github.com/ptrblck, https://github.com/janeyx99
2024-03-05 14:14:41 +00:00
12191f4b3e Fix make triton command on release branch (#121169)
Fixes #120044

Should fix build from source instructions on release branch here: https://github.com/pytorch/pytorch#from-source

Please note we are using /test/ channel for release here to make sure it works, before actual release is completed.

Test main:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/nightly/
Collecting pytorch-triton==3.0.0+a9bc1a3647
  Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Ba9bc1a3647-cp310-cp310-linux_x86_64.whl (239.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 MB 8.7 MB/s eta 0:00:00
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==3.0.0+a9bc1a3647) (3.13.1)
Installing collected packages: pytorch-triton
  Attempting uninstall: pytorch-triton
    Found existing installation: pytorch-triton 2.2.0
    Uninstalling pytorch-triton-2.2.0:
      Successfully uninstalled pytorch-triton-2.2.0
Successfully installed pytorch-triton-3.0.0+a9bc1a3647
```

Test release/2.2:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/test/
Collecting pytorch-triton==2.2.0
  Using cached https://download.pytorch.org/whl/test/pytorch_triton-2.2.0-cp310-cp310-linux_x86_64.whl (183.1 MB)
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==2.2.0) (3.13.1)
Installing collected packages: pytorch-triton
  Attempting uninstall: pytorch-triton
    Found existing installation: pytorch-triton 3.0.0+a9bc1a3647
    Uninstalling pytorch-triton-3.0.0+a9bc1a3647:
      Successfully uninstalled pytorch-triton-3.0.0+a9bc1a3647
Successfully installed pytorch-triton-2.2.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121169
Approved by: https://github.com/seemethere
2024-03-05 13:53:53 +00:00
ee557d8f61 skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697)
As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```

* Root Cause is
Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
However, the inputs of `detectron2_fcos_r_50_fpn` are as follows:

```
([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124.,  82.,  ...,   3.,   4.,   5.],
         [125., 104.,  65.,  ...,   3.,   3.,   4.],
         [ 87.,  68.,  34.,  ...,   2.,   2.,   2.],
         ...,
         [ 47.,  45.,  41.,  ...,  45.,  45.,  45.],
         [ 46.,  44.,  40.,  ...,  44.,  45.,  46.],
         [ 46.,  44.,  40.,  ...,  43.,  45.,  46.]],

        [[154., 129.,  84.,  ...,   3.,   4.,   5.],
         [133., 110.,  69.,  ...,   3.,   3.,   4.],
         [ 95.,  76.,  43.,  ...,   2.,   2.,   2.],
         ...,
         [ 44.,  42.,  38.,  ...,  34.,  37.,  39.],
         [ 43.,  41.,  37.,  ...,  35.,  39.,  41.],
         [ 43.,  41.,  37.,  ...,  35.,  40.,  43.]],

        [[171., 140.,  85.,  ...,   3.,   4.,   5.],
         [147., 120.,  71.,  ...,   3.,   3.,   4.],
         [103.,  83.,  47.,  ...,   2.,   2.,   2.],
         ...,
         [ 46.,  44.,  40.,  ...,  16.,  20.,  22.],
         [ 45.,  43.,  39.,  ...,  17.,  22.,  26.],
         [ 45.,  43.,  39.,  ...,  18.,  24.,  28.]]])}, ... ],)
```

None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-03-05 12:12:18 +00:00
c4a1570864 Temporarily increased compile time limit of #GPUs to 120. (#121076)
Fixes #115331.

This is a temporary fix to increase the compile time number of GPUs to 120 until #119639 can be merged. Changing the parameter to 128 leads to annoying errors, as some checks would be tautological (`int8_t` is always < 128).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121076
Approved by: https://github.com/albanD
2024-03-05 11:39:14 +00:00
de8af28083 [FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837)
Fixes #120722

TL;DR for the issue:
As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError.
More context in #120722.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837
Approved by: https://github.com/Skylion007
2024-03-05 10:03:44 +00:00
cyy
507611f9ae [CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969)
Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969
Approved by: https://github.com/albanD
2024-03-05 09:53:05 +00:00
46c9d646dd [Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812)
Fixes #118793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812
Approved by: https://github.com/zou3519
2024-03-05 09:05:26 +00:00
311cc564f6 Fix README Typo (#120892)
Fixes a README typo so that the prompt is consistent with VSCode 1.87.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120892
Approved by: https://github.com/albanD, https://github.com/drisspg
2024-03-05 09:05:21 +00:00
a7e93c341f [hoo] Add with_effects to handle side effectful ops (#120296)
Proposal: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bnm38nu3yfno
Implementation discussion: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bj61609o1buq

Result with print:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %with_effects : [num_users=1] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, aten.print.default, moo), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %arg1_1), kwargs = {})
    return [getitem, add]
```

Follow ups:
* Add handling to auto_functionalize
* Add support for tokens on the export side
* Add support for tokens on the inductor side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120296
Approved by: https://github.com/zou3519
2024-03-05 08:58:32 +00:00
29976519a1 Make configs hash part of remote cache key (#121152)
Summary:
While testing I noticed that if we generate different configs, we will fail to use the remote cache, so lets include configs in the cache key.

Not sure how to write a deterministic test for this.

Test Plan: existing tests

Differential Revision: D54500957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121152
Approved by: https://github.com/aakhundov
2024-03-05 08:01:24 +00:00
43416e3059 Correctly read the cache key for remote cache (#121151)
Summary: While investigating why we were calling put each time, I noticed that memcache backend returns a list instead of direct result, which means that we were correctly fetching the cached result but not using it.

Test Plan: The test should now work as expected

Differential Revision: D54500851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121151
Approved by: https://github.com/aakhundov
2024-03-05 07:33:20 +00:00
9e16622397 Move JK check to on-demand (#121182)
Summary: Some tests are failing due to checking JK during forking. Lets move the JK check to on-demand.

Differential Revision: D54518293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121182
Approved by: https://github.com/aakhundov
2024-03-05 07:03:25 +00:00
9ccff0aff9 Remove ids_of_folded_args from test_triton_kernel_equal_to_1_arg (#121192)
Summary: Due to the Triton pin update in https://github.com/pytorch/pytorch/pull/119457, `test_triton_kernel_equal_to_1_arg` started to break, as `ids_of_folded_args` has vanished from the upstream Triton codebase.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 6 tests in 6.790s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121192
Approved by: https://github.com/oulgen, https://github.com/bertmaher
2024-03-05 06:35:04 +00:00
4b49bc19e8 [export][reland] Disable exported_program.__call__ (#120019)
Summary: Reland of D53075378 / https://github.com/pytorch/pytorch/pull/119466

Test Plan: CI

Differential Revision: D53827930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120019
Approved by: https://github.com/ydwu4
2024-03-05 05:29:46 +00:00
6ddf5cf85e [AOTI] Update cpp wrapper codegen to use v2 C shim (#120714)
Summary: To use the torchgen-ed v2 C shim interface, cpp wrapper codegen needs to update its rule for generating the right parameter and function call. Because changing the emitted code will cause a FC breakage, we add a flag to control the behavior.

Differential Revision: [D54258086](https://our.internmc.facebook.com/intern/diff/D54258086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120714
Approved by: https://github.com/chenyang78
ghstack dependencies: #120513
2024-03-05 04:32:32 +00:00
bd19d6d822 [AOTI] Use torchgen to generate C shim functions (#120513)
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as

* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values

https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.

This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.

Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
2024-03-05 04:28:44 +00:00
ffe45a8188 [ATen-vulkan] Implement global shader registry (#121088)
Differential Revision: D54447700

## Context

This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime.

Before:

* `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables.

After:

* Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file
* Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function`
* Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration
* `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry.

Benefits:

* Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.*`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.*`
* Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088
Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415
2024-03-05 03:56:57 +00:00
c3c618c750 Update torchbench pin (#121029)
Fixes https://github.com/pytorch/pytorch/issues/117280 after bumping the HF version in https://github.com/pytorch/benchmark/pull/2179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121029
Approved by: https://github.com/desertfire
2024-03-05 03:21:32 +00:00
a15c02562a Fix dynamo failure (#121167)
Summary: Title

Test Plan: CI

Differential Revision: D54509198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121167
Approved by: https://github.com/izaitsevfb
2024-03-05 03:19:59 +00:00
3381f282c3 Revert "Update Triton (#119457)"
This reverts commit d49864f6a526d3def25f8da2fa9b8815b3347b9d.

Reverted https://github.com/pytorch/pytorch/pull/119457 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing test_triton_kernels in trunk d49864f6a5 ([comment](https://github.com/pytorch/pytorch/pull/119457#issuecomment-1977792634))
2024-03-05 01:46:44 +00:00
9deaa2e812 [BE]: FURB187 Use inplace reverse on lists: faster, more readable. (#121140)
Use `reverse()` method as it's faster and inplace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121140
Approved by: https://github.com/albanD
2024-03-05 01:36:17 +00:00
ec4146c535 [inductor] skip foreach kernel for benchmark fusion (#121168)
benchmark fusion currently does not support foreach kernel. If we don't explicitly skip foreach kernels, we end up with exceptions in `codegen_node_schedule` because individual nodes in a foreach kernel may have incompatible shapes from pointwise/reduction perspective.

cc Manman Ren ( @manman-ren ) who reported the issue when turning on benchmark fusion on BertForMaskedLM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121168
Approved by: https://github.com/Chillee
2024-03-05 01:27:55 +00:00
bcf35c6ae6 [tensorboard] Handle bfloat16 type in add_histogram (#120087)
Summary:
add_histogram fails for this data type. Updating conversion code to handle it.

Stack trace for the failure -

`
[trainer0]Traceback (most recent call last):
[trainer0]  File "<torch_package_0>.tensorboard/logging/summary_v2.py", line 203, in unscriptable_record_summary
[trainer0]    unscriptable_histogram(name, t, step, ranks)
[trainer0]  File "<torch_package_0>.tensorboard/logging/fx_v1.py", line 146, in unscriptable_histogram
[trainer0]    Adhoc.writer().add_histogram(tag, x, step.int())
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer.py", line 40, in wrapper
[trainer0]    resp = super_method(*args, **kwargs)
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer_oss.py", line 526, in add_histogram
[trainer0]    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/summary.py", line 482, in histogram
[trainer0]    values = make_np(values)
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 23, in make_np
[trainer0]    return _prepare_pytorch(x)
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 30, in _prepare_pytorch
[trainer0]    x = x.detach().cpu().numpy()
[trainer0]TypeError: Got unsupported ScalarType BFloat16
`

Test Plan: Updated unit test that was failing before but passes after this change.

Reviewed By: hamzajzmati, jcarreiro

Differential Revision: D53841197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120087
Approved by: https://github.com/jcarreiro, https://github.com/yanboliang
2024-03-05 00:27:21 +00:00
a3a8137484 [onnxrt, dynamo] Fix run with inputs on mix devices (#121159)
`onnxrt` assumes all tensors are on the same device before, and this PR fixes that by setting individual device for each tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121159
Approved by: https://github.com/thiagocrepaldi
2024-03-04 23:39:33 +00:00
83c312990f Add missing newline to repro and some utility thing in repro (#121051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121051
Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eellison
2024-03-04 22:52:54 +00:00
eba28a6f91 [VK-API][Op Redesign][3/n] Expose new Context and Resource APIs (#121060)
Summary: For use in the next diff.

Test Plan: sc

Differential Revision: D54397862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121060
Approved by: https://github.com/SS-JIA
2024-03-04 22:26:07 +00:00
70c23a51ac Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 0a38a6ac8046e4d3f9cfaba86b7ec6517038646f.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/clee2000 due to broke inductor models and caused accuracy regression on nightly dashboard 0a38a6ac80 https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228 ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1977556485))
2024-03-04 22:13:23 +00:00
df3c8b8390 [fake_impls] Fix seed/offset device for attention kernels (#120839)
1) Fix fake_impls to return the correct device for these attention
   kernels.
2) Remove special-casing and test file xfails
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120839
Approved by: https://github.com/drisspg
2024-03-04 22:02:32 +00:00
6a5c7d5f95 [ATen-vulkan] Enable deferred descriptor pool initialization (#121134)
Differential Revision: D54487619

## Context

Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount.

## Implementation Details

* Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized
* Introduce `DescriptorPool::init()` function to trigger intialization
* Introduce safeguards against using an uninitialized descriptor pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134
Approved by: https://github.com/manuelcandales
2024-03-04 21:37:32 +00:00
0c07c0c15f Revert "add int4 packed gemm support on CPU device (#117475)"
This reverts commit 30befa592e0675cc694f87a4f6fb80894709e719.

Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))
2024-03-04 21:20:57 +00:00
74b19fa8b9 fix fsdp device mesh depenency issue (#121061)
as reported in https://github.com/pytorch/torchtrain/pull/103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121061
Approved by: https://github.com/awgu
2024-03-04 21:20:09 +00:00
7a065e3b23 improve the constantLR doc (#120852)
Fixes #120716
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120852
Approved by: https://github.com/janeyx99
2024-03-04 21:15:27 +00:00
cb812c9832 Add windows constraint to mkl package in wheel (#121014)
Follow up on: https://github.com/pytorch/pytorch/pull/102604
Address this comment: https://github.com/pytorch/pytorch/pull/102604#discussion_r1419944305

Whl metadata for all wheels published to pypi must match, otherwise poetry install will fail see this comment:
https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121014
Approved by: https://github.com/malfet
2024-03-04 20:54:26 +00:00
4cdc2d7096 [dynamo] Remove expected dynamo test failures (#120836)
Fixes some of the tests in #120643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120836
Approved by: https://github.com/zou3519
2024-03-04 20:41:49 +00:00
a98c17edc7 Revert "add int8 packed gemm support on CPU device (#118056)"
This reverts commit f84375ca5db623a6a53cbce2864d27dfad626228.

Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))
2024-03-04 20:09:40 +00:00
9ff65d56a5 Revert "delete useless cast_outputs call in unary_op_impl_float_out (#120486)"
This reverts commit d053dcfa69a52e6b9f9f2ba997b6bffbc9b29bb5.

Reverted https://github.com/pytorch/pytorch/pull/120486 on behalf of https://github.com/izaitsevfb due to Fails meta internal tests ([comment](https://github.com/pytorch/pytorch/pull/120486#issuecomment-1977343125))
2024-03-04 19:52:23 +00:00
26431db939 [ONNX] Perform implicit casting of constants for the onnx::where operator (#118733) (#120619)
This PR fixes the problem of having the `Where` operator bound to different types in cases where the dtype is not explicitly set. The PR extends the implicit casting to the onnx::Where operator to fix the issue, and includes the corresponding unit test.

Fixes #118733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120619
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-04 19:27:30 +00:00
58047205ed Delete unnecessary code (#120365)
Summary: Title

Test Plan: CI

Differential Revision: D53828357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120365
Approved by: https://github.com/Skylion007
2024-03-04 18:02:58 +00:00
2e6c08a14b Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)
# Summary
Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5).

The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935
Approved by: https://github.com/cpuhrsch
2024-03-04 17:36:22 +00:00
d49864f6a5 Update Triton (#119457)
Fix pytorch nightly compilation for cuda linking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457
Approved by: https://github.com/bertmaher
2024-03-04 17:04:59 +00:00
6566b3db67 Add an autotune cache for inductor generated kernels (#120963)
Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally.

Test Plan:
tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1`

Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk

The plan is to land this diff with this turned off and gradually introduce this.

Differential Revision: D54398076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963
Approved by: https://github.com/jansel
2024-03-04 16:58:37 +00:00
3ef0befdc9 Better error messages for impl_abstract_pystub (#120959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959
Approved by: https://github.com/drisspg
2024-03-04 15:24:36 +00:00
ce2903080c Add sparse compressed fake tensor support (#120920)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920
Approved by: https://github.com/ezyang
2024-03-04 14:38:45 +00:00
c06499981d Add a decomposition for torch.put, 2. (#120179)
As in the title. It is an updated copy of https://github.com/pytorch/pytorch/pull/115306 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120179
Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5
2024-03-04 14:37:30 +00:00
8ba49d0e53 Fix compilation error: load_fp32_from_fp16’ was not declared in this scope for ppc64le (#120307)
This patch adds missing Implementation of load_fp32_from_fp16 for half. Fixes the error  load_fp32_from_fp16’ was not declared in this scope .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120307
Approved by: https://github.com/jgong5
2024-03-04 11:08:39 +00:00
27ac73073b Fix hipification issue (#121107)
Differential Revision: D54470055

```
buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:201:61: error: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]
    return ((threadIdx.x  + thread_work_elem*num_threads()) < remaining);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ^ ~~~~~~~~~
```

```
buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:223:15: error: unused variable 'to' [-Werror,-Wunused-variable]
    scalar_t *to = reinterpret_cast<scalar_t *>(data[0]) + block_work_size() * idx;
              ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121107
Approved by: https://github.com/chenyang78
2024-03-04 09:41:21 +00:00
2e50566722 [dtensor] change distribute_module input/output_fn to accept module (#120895)
This is a BC breaking change to distribute_module. The underlying rationle
for this change is that sometimes in the input_fn/output_fn, user would want
to access to the current module for some attributes. This might not be
common enough, but in some cases it's worth to access to the module.

An outstanding use case we want to support is float8, if we want to make
float8 works with the TP API, the input_fn/output_fn of TP parallel
styles would need to get access to the module, where the module might
encapsulates `dynamic_linear.emulate` attribute, that is useful for
input/output casting

Since this is needed for fp8 and DTensor still under prototype release,
I feel it's worth the change and it's better we make the change as
early.

Right now making it a soft BC breaking, which means we maintain BC still
but throw deprecation messages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895
Approved by: https://github.com/tianyu-l
2024-03-04 07:22:32 +00:00
3045b16488 Do not use warm_pool() if TorchTnT is used (#121047)
Summary: This diff is needed to avoid QPS drop when parallel compilation is used with TorchTNT.

Test Plan:
On TNT
* https://www.internalfb.com/mast/job/torchx-ldm_train-hxjhl0k1wjz93
On PyPer
* f537224855

Differential Revision: D54430900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121047
Approved by: https://github.com/yanboliang
2024-03-04 06:14:11 +00:00
cyy
4b494d0750 Fix comparison of integer expressions of different signedness (#121066)
Fixes these warnings
```
src/aten/src/ATen/native/cuda/ForeachReduceOp.cu:190:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121066
Approved by: https://github.com/tringwald, https://github.com/Skylion007
2024-03-04 02:14:10 +00:00
c83dfc8854 [PT2][Inductor] Fix missing "example_value" for nodes introduced by group batch fusion (#120974)
Summary: Similar to D54140488, we fix more such bugs

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```

Differential Revision: D54399360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120974
Approved by: https://github.com/jackiexu1992
2024-03-04 02:11:57 +00:00
cead0363a8 [jit][nested strided tensor] support nested tensor in check_trace (#121039)
Summary:
torch.testing.assert_equal doesn't support nested strided tensors because sizes is not implemented.

This adds special handling for nested tensors by checking for nested tensors unbinding if they are found.

Test Plan: test_trace_with_nested_strided_tensor_output

Differential Revision: D54430238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121039
Approved by: https://github.com/YuqingJ
2024-03-04 01:15:45 +00:00
089f4c0bd9 If data dependent, check if guard_size_oblivious would fix problem and report if so (#121011)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121011
Approved by: https://github.com/lezcano
2024-03-03 23:23:14 +00:00
cyy
13fadea888 [Clang-tidy header][21/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#120763)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120574.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120763
Approved by: https://github.com/Skylion007
2024-03-03 23:18:43 +00:00
4f0481e1d5 [inductor] add decompostition for mm in backward (#120933)
Summary:
1) As a follow up in D53602514. Found a new way to decompose mm in backward. Sum the permuted input and reduce along 0 dim. Some benchmark result P1190140001. 30x speedup
Some explanations on why the original mm decomposition is slow. For mxkxn mm, when m is small and k is large, the stride for lhs is [m,1], hence it need to access memory k times to load all the data. As a result, decomposition will be effective with permute since the stride will be [k,1].

2) add another pattern for large k. benchmark result P1190596489 28x speedup

3) fix the value not found error in ig ctr. f536115499

Test Plan:
pt2 decompose:

 {F1462894821}
decompose: f536159404
baseline: f536282578
705k vs 725k 4% for ig ctr

Differential Revision: D54294491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120933
Approved by: https://github.com/mengluy0125
2024-03-03 18:46:42 +00:00
b7f2522692 [dynamo][compile-time] Remove unnecessary tree_map_only (#121052)
Reduces the torch.compile(backend="eager") for this code by 1-2 seconds.

~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121052
Approved by: https://github.com/jansel
ghstack dependencies: #121053
2024-03-03 06:59:43 +00:00
368f242e37 Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)"
This reverts commit 8c2e569928a200893fe971e615b82a2f9ce32630.

Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))
2024-03-03 02:58:47 +00:00
0e0a621e0c [dynamo] Minor refactors (#120966)
These are changes I pulled out of the above PRs due to not being
related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120966
Approved by: https://github.com/yanboliang
2024-03-03 02:20:48 +00:00
8e4301077e [dynamo][comp-time] BuiltinVariableTracker - inspect signature only on failure (#121053)
Reduces the torch.compile(backend="eager") for this code by 1-2 seconds.
~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121053
Approved by: https://github.com/jansel
2024-03-02 23:03:00 +00:00
7aced61c46 [DCP] deletes legacy formatting test (#120127)
Should no longer be necessary

Differential Revision: [D53791345](https://our.internmc.facebook.com/intern/diff/D53791345/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120127
Approved by: https://github.com/fegin
ghstack dependencies: #119816
2024-03-02 22:04:39 +00:00
7f81563e5e [dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120739
Approved by: https://github.com/jansel
ghstack dependencies: #120673
2024-03-02 13:15:53 +00:00
82d1465d8d [dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120673
Approved by: https://github.com/jansel
2024-03-02 13:15:53 +00:00
bab4b5a341 [dist][sharded_tensor] Fix ChunkShardingSpec metadata offsets for empty shards (#121002)
ChunkShardingSpec generated metadata where offsets exceed the tensor size.

Example:

Torchrec prepared ShardedTensorMetadata:
```
ShardedTensorMetadata(shards_metadata=[
ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0),
ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1),
ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2),
ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3),
ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6)
],
size=torch.Size([10, 512]
),
```
Calling ShardedTensor._init_from_local_shards_and_global_metadata()
ShardedTensor ShardingSpec builds metadata

```
ShardedTensorMetadata(shards_metadata=[
ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0),
ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1),
ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2),
ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3),
ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5),
ShardMetadata(shard_offsets=[12, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6)
],
size=torch.Size([10, 512]), tensor_properties=TensorProperties(dtype=torch.float16, layout=torch.strided, requires_grad=False, memory_format=torch.contiguous_format, pin_memory=False))
```
The deduced ChunkShardingSpec:
```
ChunkShardingSpec(dim=0, placements=[rank:0/cuda:0, rank:1/cuda:1, rank:2/cuda:2, rank:3/cuda:3, rank:4/cuda:4, rank:5/cuda:5, rank:6/cuda:6])
```

The fix is to limit offsets by dim size.

Differential Revision: [D54419513](https://our.internmc.facebook.com/intern/diff/D54419513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121002
Approved by: https://github.com/wz337
2024-03-02 08:58:48 +00:00
suo
66b20b4297 [export][ez] minor variable rename (#121040)
since `_export()` now takes an `nn.Module` only (which is asserted against at an upper layer), we should change this variable name from `f` to `mod` and remove some unnecessary isinstance checks

Differential Revision: [D54430381](https://our.internmc.facebook.com/intern/diff/D54430381/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121040
Approved by: https://github.com/angelayi
ghstack dependencies: #121037
2024-03-02 08:49:06 +00:00
suo
505637198a [export] cleanup to rewrite steps (#121037)
1. Some underscores for consistency of private functions.
2. remove dead code in `_replace_param_buffer_names`

Differential Revision: [D54429206](https://our.internmc.facebook.com/intern/diff/D54429206/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121037
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-02 08:45:50 +00:00
b0cfa96e82 [Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942)
Summary:
Expose an option to users to specify name of the LogsSpec implementation to use.
- Has to be defined in entrypoints under `torchrun.logs_specs` group.
- Must implement LogsSpec defined in prior PR/diff.

Test Plan: unit test+local tests

Reviewed By: ezyang

Differential Revision: D54180838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942
Approved by: https://github.com/ezyang
2024-03-02 08:07:52 +00:00
f351a71dbb remove constraints from capture_pre_autograd_graph (#120981)
Differential Revision: D54407296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120981
Approved by: https://github.com/zhxchen17
2024-03-02 07:00:51 +00:00
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
af5376c444 [dtensor] add support for loss parallel (#119877)
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.

Here are the underlying rationales why we are going through these op replacements:

1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
2024-03-02 05:06:26 +00:00
c4ed456fc3 [inductor] fix accuracy failure for a few models under freezing (#121054)
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.

For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.

For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.

One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
2024-03-02 04:53:59 +00:00
f84375ca5d add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
ghstack dependencies: #117475
2024-03-02 04:35:49 +00:00
5258c3645d [ATen-vulkan][EZ] Bug fixes: only create the image view when memory has been bound, invalidate cmd on flush (#121027)
Summary:
## Context

Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android.

1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory.
2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`.

Test Plan:
Build the test binary for Android:

```
buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output
```

Push and run the test binary on a local android phone.

Differential Revision: D54425370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027
Approved by: https://github.com/mcr229, https://github.com/cbilgin
2024-03-02 04:35:46 +00:00
2d9efad38f Add the bound check for flatten with out_dim (#120894)
Fixes #120762

The bound is not valid in the example but unchecked.
```
a = torch.tensor([1, 2, 3])
a.flatten(start_dim=0, end_dim=1, out_dim='a')
```

The same is checked for the case

```
a = torch.tensor([1, 2, 3])
a.flatten(start_dim=0, end_dim=1)
```

- Therefore, just apply the same check.

@malfet @janeyx99
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120894
Approved by: https://github.com/malfet, https://github.com/spzala
2024-03-02 03:56:55 +00:00
06fe6ed82b [dynamo bug burndown] update tensor creation to support sequences of tensors (#120872)
Fixes https://github.com/pytorch/pytorch/issues/120645

`_internal_new_from_data` calls `_recursive_build`, but we run into an error such as the cases.
```
Failed running call_function <function tensor at 0xDEADBEEF>:
scalar_tensor(): argument (position 1) must be Number, not FakeTensor

# e.g. cases
1. [FakeTensor(..., size=(20, 1), dtype=torch.float64), ..., FakeTensor(..., size=(20, 1), dtype=torch.float64)]
- Here, we call _recursive_build(sizes=[4] ...) which hits the base case `if dim == ndim:` in the 2nd level of recursion.
- So, we try to return `scalar_tensor(FakeTensor)`
2. [[(FakeTensor(..., size=(1,), dtype=torch.int64), FakeTensor(..., size=(), dtype=torch.int64)]]

# site note: when can size = ()? Probably from scalar_tensor.
>>> torch.scalar_tensor(1).shape
torch.Size([])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120872
Approved by: https://github.com/ezyang
2024-03-02 02:22:59 +00:00
a3b81666b1 [Dynamo] Fix guards for code objects (#120909)
By comparing them only by id, and raising an assert if someone calls into `EQUALS_MATCH`
Which render following example compileable:
```python
import torch

@torch.compile()
def foo(x, y):
    code = compile(y, "foo", "exec")
    exec(y)
    return x

print(foo(torch.rand(3), "print('Hello World')"))
```

Fixes https://github.com/pytorch/pytorch/issues/120647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120909
Approved by: https://github.com/jansel
2024-03-02 02:17:17 +00:00
f7a2bae0ac Change TestOpWaitiness to use MultiProcessTestCase (#121046)
The test has been failing sporadically rencetly in CI and the failures
are not reproducible locally, likely due to some nasty race conditional
related a combination of MultiThreadedTestCase, the use of global state
and finalizers, and the recently introduced test decorator for native
funcol migration.

Switching to the test to use MultiProcessTestCase to provide better
isolation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046
Approved by: https://github.com/weifengpy
2024-03-02 01:12:14 +00:00
4cf6d1172b [FSDP2] Used ReduceOp.AVG if fp32 reduce-scatter (#120919)
This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
ghstack dependencies: #120238, #120910
2024-03-02 00:39:16 +00:00
85157af784 Fix more xfails for scaled_dot_product_attention (#121032)
Followup to #120928. - should fix #120921 .

I missed one test in #120928 - test_dispatch_symbolic_meta_outplace_all_strides. This wasn't caught because #120921 was open at the time, disabling the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121032
Approved by: https://github.com/drisspg
2024-03-02 00:28:44 +00:00
7c71d7f32b [DTensor] Supported foreach=True for clip_grad_norm_ (#120910)
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.

`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
2024-03-02 00:28:09 +00:00
f0e8e7cf43 [DTensor] Supported foreach=False for clip_grad_norm_ (#120238)
This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`).

To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238
Approved by: https://github.com/wanchaol
2024-03-02 00:25:16 +00:00
30befa592e add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-02 00:17:34 +00:00
c8e56b4965 [c10d] dump from one and only one thread (PG0's monitor thread) (#120893)
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.

In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.

monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
2024-03-02 00:13:13 +00:00
3d7cf8f392 Revert "Limit loop unrolling (#120023)"
This reverts commit 6cc7f9a2e6bedff3109ea066278e9805713da4bb.

Reverted https://github.com/pytorch/pytorch/pull/120023 on behalf of https://github.com/anijain2305 due to breaks llms export ([comment](https://github.com/pytorch/pytorch/pull/120023#issuecomment-1974104633))
2024-03-02 00:04:08 +00:00
d8395830ea [ONNX][dynamo_export] Skip instance_norm decomp for export (#120866)
Otherwise, instance_norm is decomposed into batch_norm with training set to True.
Downstream exporter has no way to figure out that training is actually not needed.
On the other hand, ONNX does have InstanceNormalization operator defined, however
due to decomp, it unnecessarily exports as batch norm and glue code.

Depends on https://github.com/microsoft/onnxscript/pull/1284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120866
Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms
2024-03-01 23:51:16 +00:00
581fe26792 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
2024-03-01 23:45:43 +00:00
0a38a6ac80 [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)
According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-01 23:32:59 +00:00
06b52dd103 TD outside of test job (#118250)
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues

* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
  * Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
2024-03-01 23:08:10 +00:00
d08ce51881 [compiled autograd] refactor eager test loading and run custom ops tests (#120679)
TestCustomOp's tests uses helper attributes and functions from a util parent class. To support arbitrary test classes, we need to refactor the current approach. Instead of allowlisting certain methods, we can instead copy the whole class and only overwrite the "test_.*" methods.

Compiled autograd fails on ~10/90 of the newly added tests. test_autograd_function_backed_op is the example we discussed in PT-2D meeting about requiring c++ autograd::Function support. I'm addressing this in #120732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120679
Approved by: https://github.com/jansel, https://github.com/zou3519
2024-03-01 22:48:17 +00:00
8cb4855d1e Release the GIL in serialization when it is safe to do so (#120818)
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects

Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
2024-03-01 22:37:26 +00:00
fd2ab1f613 [PT2][Inductor] Change the split cat log to debug (#120823)
Summary: Address the report in https://github.com/pytorch/pytorch/issues/120771.

Test Plan: see signal

Differential Revision: D54323475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120823
Approved by: https://github.com/jackiexu1992
2024-03-01 22:34:23 +00:00
797d4fbdf4 [export] Log operator set. (#120951)
Summary: as title. We want to count the number of total operator calls, and the distinct set of operators in the exported graph.

Test Plan: CI

Differential Revision: D54390298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120951
Approved by: https://github.com/tugsbayasgalan
2024-03-01 20:58:31 +00:00
d3876f73e7 Preserve metadata for MutableMapping and MutableSequence in pin_memory and collate_fn (#120553)
For the user-defined `Mapping` type, it may contain some metadata (e.g., pytorch/tensordict#679, https://github.com/pytorch/pytorch/pull/120195#issue-2141716712). Simply use `type(mapping)({k: v for k, v in mapping.items()})` do not take this metadata into account. This PR uses `copy.copy(mapping)` to create a clone of the original collection and iteratively updates the elements in the cloned collection. This preserves the metadata in the original collection via `copy.copy(...)` rather than relying on the `__init__` method in the user-defined classes.

Reference:

- pytorch/tensordict#679
- #120195

Closes #120195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120553
Approved by: https://github.com/vmoens
2024-03-01 20:43:42 +00:00
a7c799fb85 [executorch] Add support for method variants in aten executorch code gen (#121016)
Summary: Title.

Test Plan: The added unittest

Differential Revision: D54423028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016
Approved by: https://github.com/larryliu0820
2024-03-01 20:33:02 +00:00
7a64eb65e4 Fix Dynamo tests failing with "Failed running call_function <built-in function linalg_norm" (#120993)
When iterating the ord value through an array, we are sharing the same torchdynamo context. This makes dynamo treat the `ord` variable as dynamic shape, causing problems.

In the `vector_norm` decomposition, casting the int type ord to float will fix this problem.

Fixes https://github.com/pytorch/pytorch/issues/119795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120993
Approved by: https://github.com/lezcano
2024-03-01 20:27:45 +00:00
39e4d1a535 Make TestEmbeddingNNDeviceTypeCPU::test_EmbeddingBag_per_sample_weights_and_no_offsets_cpu_int32_float32 compatible with TorchDynamo (#120831)
Previously, the test case directly accesses the tensor data via tensor.data which is not supported on FakeTensor. So we manually copy the tensor as a workaround.
Fixes: https://github.com/pytorch/pytorch/issues/119788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120831
Approved by: https://github.com/janeyx99
2024-03-01 20:27:41 +00:00
e02047add4 [BE][Ez]: Update ruff to 0.3.0 (#121003)
Update ruff to 0.3.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121003
Approved by: https://github.com/malfet
2024-03-01 20:20:55 +00:00
af93849a3a [pt2 export] small fix on non_persistent buffer unlift (#120715)
Summary: Change to get_buffer from the input plain_graph_module instead of the new stateful_gm when restoring non_persistent buffers, since the stateful_gm doesn't contain the buffer yet.

Test Plan:
Added test case.
`buck test caffe2/test:test_export -- test_unlift_nonpersistent_buffer`

Differential Revision: D54216772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120715
Approved by: https://github.com/zhxchen17
2024-03-01 20:20:00 +00:00
19fcf6de1a Add lowering for fraction_max_pool2d (#120460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120460
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-03-01 20:13:20 +00:00
cdb50d0380 remove constraints from aot_compile (#120979)
Differential Revision: D54405986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120979
Approved by: https://github.com/zhxchen17
2024-03-01 20:06:21 +00:00
55ae8fb1f6 Switched m1 runners to the lable macos-m1-stable (#120997)
Switched m1 runners to use  `macos-m1-stable` label, which points to exactly the same M1 running MacOS-13.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120997
Approved by: https://github.com/malfet
2024-03-01 19:52:34 +00:00
de3202abea [EZ][BE] Remove Python-2 installation logic (#121015)
Not sure why it's still there in 2024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121015
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2024-03-01 19:39:02 +00:00
b474a523c6 Ban passing in free function into capture_pre_autograd_graph (#120817)
Summary: Today we don't allow free functions to be tracing callable in torch.export. As a part of migrating capture_preautograd_graph usages to torch.export, we need to ban free functions to capture_preautograd_graph  as well

Test Plan: CI

Differential Revision: D54319597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120817
Approved by: https://github.com/zhxchen17, https://github.com/andrewor14
2024-03-01 19:38:58 +00:00
ce50db22c2 Handle transposition pattern seen in SDPA with unbacked SymInts (#121005)
Fixes https://github.com/pytorch/pytorch/issues/121000

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121005
Approved by: https://github.com/lezcano
2024-03-01 18:58:19 +00:00
11f2e8beac [Dynamo, Compiled] Save some python overhead when calling compiled function with many tangents (#118730)
When a dynamo backend captures the entire forward pass and the entire backward pass without graph break, there could be many (per my memory, hundreds or thousands for big model) `contiguous` calls. Here we can save those overhead by checking `is_contiguous` before `contigous` call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118730
Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang
2024-03-01 18:57:18 +00:00
0b18ed1c47 [FSDP] Added warning about unsupported double backwards (#120926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120926
Approved by: https://github.com/Skylion007
2024-03-01 18:40:30 +00:00
f01a23d01b Don't aggressively rewrite asserts for symbolic expressions (#120564)
Fixes: https://github.com/pytorch/pytorch/issues/118417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120564
Approved by: https://github.com/ezyang
2024-03-01 17:46:36 +00:00
c844b377fa [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 17:04:24 +00:00
9fc56f8209 Exclude operators that produce unbacked symbols (#120917)
Unbacked symbols vary at runtime which means they are not CUDA
graphable.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120917
Approved by: https://github.com/eellison
2024-03-01 16:56:08 +00:00
ea7149aa22 Replace TTIR string parsing with structured MLIR walk in Triton kernel mutation analysis (#120476)
Summary: Previously, we relied on the `lark`-based parsing of the string TTIR representation dumped by the Triton compiler. However, this has proven to be brittle in the face of changes both in the user-written Triton kernel code and in the Triton compiler code.

In this PR, we add an alternative way of mining the function information from the TTIR based on walking the tree of structured MLIR entities. To this end, we rely on the MLIR bindings exposed by `libtriton` (related PR in Triton: https://github.com/openai/triton/pull/3191).

For now, we introduce gating based on whether `ttir_module.hasattr("walk")`. This will allow switching to the newly introduced TTIR analysis approach only when the new MLIR bindings (including that of `ModuleOp::walk`) become available in the Triton pin. Before then, we'll keep using the old string TTIR parsing-based approach.

Test Plan: The new functionality was tested locally with the latest Triton version compiled with the added new MLIR bindings: all Triton kernel mutation tests in `test_triton_kernels.py` are passing. Here we rely on the CI for regression testing, but it won't cover the new functionality due to gating.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120476
Approved by: https://github.com/oulgen
2024-03-01 16:20:24 +00:00
8861507ba3 Fix guard for SUPPORTED_NODES (#120798)
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.

This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module.  It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.

Also added a unit test which fails before this change and passes after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
2024-03-01 16:03:21 +00:00
b8e6ca6f76 Add sparse compressed meta tensor support (#120707)
As in the title.

Replaces https://github.com/pytorch/pytorch/pull/120498 and https://github.com/pytorch/pytorch/pull/120562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120707
Approved by: https://github.com/ezyang
ghstack dependencies: #120703
2024-03-01 13:28:47 +00:00
70d4d109f2 Make SparseCsr a functionality dispatch key (#120703)
As in the title.

To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703
Approved by: https://github.com/ezyang
2024-03-01 13:28:46 +00:00
eee040c939 expose nested header to wheel (#120603)
expose nested header to pytorch wheel, help with developers for reuse pytorch nested tensor related utils header inside wheel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120603
Approved by: https://github.com/jbschlosser, https://github.com/gujinghui
2024-03-01 09:59:45 +00:00
c646030cd2 Support higher order op functionalization in predispatch IR (#115314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115314
Approved by: https://github.com/bdhirsh
2024-03-01 09:13:47 +00:00
82b356193d Move VariableInfo into its own file to avoid circular dependency (#120732)
VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way.

Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732
Approved by: https://github.com/jansel
2024-03-01 08:48:13 +00:00
8c2e569928 [PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)
With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model.

Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454
Approved by: https://github.com/yf225, https://github.com/xmfan
2024-03-01 08:35:22 +00:00
cyy
77ef9d4022 Add verbose parameter to torch.hub.list (#120717)
This PR adds ```verbose``` to ```torch.hub.list``` to let  users being able to disable extraneous outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120717
Approved by: https://github.com/ezyang
2024-03-01 07:39:48 +00:00
63b259492a Revert "[dynamo] Reorder logs (#116106)"
This reverts commit c5472628ff9dedff57722941ac1b2a50af880197.

Reverted https://github.com/pytorch/pytorch/pull/116106 on behalf of https://github.com/clee2000 due to landrace with 342e7929b804ec56121e82e92d6a199b549c38b1, which removed the import for warnings.  Should be an easy fix after rebase c5472628ff ([comment](https://github.com/pytorch/pytorch/pull/116106#issuecomment-1972586180))
2024-03-01 06:25:46 +00:00
eqy
86e6497c6f [Inductor][cuDNN] Disable tf32 in test_mutate_view_for_conv_output (#120953)
Another disablement of TF32 to unblock #120642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120953
Approved by: https://github.com/Skylion007
2024-03-01 05:51:29 +00:00
6ed26392b3 Update xfails for scaled_dot_product_attention (#120928)
Update xfails for test_dispatch_meta_outplace and test_dispatch_symbolic_meta_outplace.

These tests are sometimes expected to fail, because we moved the registrations from meta_registrations.py to fake_impls.py. AFAIK, this is okay because fake tensors will still work because we have special handling in fake_impls.py. The purpose of this PR is to update the xfails so they are correctly xfailing the failing tests.

Previously, I set these to xfail only for bfloat16, float16, and float32, but not float64; but this isn't really correct. Explanation below:

Scaled dot product attention (SDPA) has multiple implementations, including efficient_attention, flash_attention, and unfused attention. flash_attention supports fp16, bf16. efficient_attention supports fp16, bf16, fp32. unfused attention supports all dtypes.

efficient_attention and flash_attention implementations will fail the meta tests, but the unfused attention will not. Certain platforms may support none, both, or one of efficient_attention and flash_attention. Unfused attention will pass because it falls back to constituent ops which have registered meta kernels.

So: on CUDA, we have all 3 available: in bf16, fp16, fp32, we'll select one of the fused implementations (where this test will fail).
On ROCM, we don't have efficient_attention: so fp32 will use the unfused implementation, where the test will pass.

Fix in this PR:
* If any fused impl is available, then xfail float16 & bfloat16
* If efficient_attention is available, then also xfail float32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120928
Approved by: https://github.com/drisspg
2024-03-01 05:16:11 +00:00
2a08a51738 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
ghstack dependencies: #120800
2024-03-01 05:06:36 +00:00
77aea289ae Add test to check that COW inputs are not materialized (#119507)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507
Approved by: https://github.com/ezyang
ghstack dependencies: #120455
2024-03-01 05:05:28 +00:00
13a54ce279 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
d053dcfa69 delete useless cast_outputs call in unary_op_impl_float_out (#120486)
cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486
Approved by: https://github.com/ezyang
2024-03-01 04:54:11 +00:00
c5472628ff [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 04:48:44 +00:00
02a410ee12 Enable TORCH_TRACE by default in all Tupperware like environments (#120915)
Summary:
This is a reimplemented version of the FB specific code in https://www.internalfb.com/diff/D54230697

The new strategy is that we unconditionally install an FB handler to trace_log logger (and always set level to DEBUG). When the first log message is emitted, we check the JK/filesystem to see if we should actually do logging. If we decide we don't do logging, we remove the handler from trace_log and are done.

build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo_inductor,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smartpytorchgithub_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor]

Test Plan:
sandcastle

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/inference/tests:test_single_gpu_executor -- --exact 'torchrec/inference/tests:test_single_gpu_executor - TorchDeployGPUTest.NestedModelSingleGPU'
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test -- --exact 'dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test - test_global_fixed_interval_accumulator (dper_lib.silvertorch.modules.dynamic_stats.tests.accumulators_test.GlobalFixedIntervalUnivalentAcculumatorTest)'
```

Also running a test flow with/without JK enabled

Differential Revision: D54275086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120915
Approved by: https://github.com/yanboliang
2024-03-01 04:47:13 +00:00
518a23bb03 support bool as Scalar Type in TorchScript (#113835)
Fixes #112402
Fixes #75465

From the description in #75465 , the bool type should subtype from the int. and `register_prim_ops.cpp` already supports converting from bool to int or float.
So this patch can fix bool as Scalar in TorchScirpt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113835
Approved by: https://github.com/davidberard98
2024-03-01 04:20:15 +00:00
2e84d01d05 [executorch hash update] update the pinned executorch hash (#120747)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120747
Approved by: https://github.com/pytorchbot
2024-03-01 04:02:09 +00:00
65d568680c Revert "[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812)"
This reverts commit 1104e0798c8206e0226f2d68f6bb065645e6276f.

Reverted https://github.com/pytorch/pytorch/pull/120812 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure test_simple_model look legit 1104e0798c ([comment](https://github.com/pytorch/pytorch/pull/120812#issuecomment-1972460001))
2024-03-01 03:53:27 +00:00
e49f31ca02 [onnxrt, dynamo] Enable custom ONNX model transforms in onnxrt dynamo backend (#120854)
A global transorm list is created. All backend instances call the transform functions in that list sequentially to modify the exported ONNX model before sending model to ORT session. For example, `record_onnx_model_transform` below is a no-op transform and only records the ONNX graphs sent to ONNXRuntime.

```python
        recorded_models = []

        def record_onnx_model_transform(onnx_model):
            # Record the ONNX model seen by the transform.
            recorded_models.append(onnx_model)

        from torch.onnx import (
            register_backend_graph_transform,
            unregister_backend_graph_transform,
        )
        # Register so that `onnxrt` backend calls it to modify ONNX model.
        register_backend_graph_transform(record_onnx_model_transform)

        def example_model(x: torch.Tensor):
            y = torch.sigmoid(x)
            z = x + y
            return z

        # During the compilation, the exported ONNX model will be
        # modified by calling `record_onnx_model_transform` before
        # sending the model to `onnxruntime.InferenceSession`.
        compiled_model = torch.compile(
            example_model,
            backend="onnxrt",
            dynamic=True,
        )
        # Now, `recorded_models` should contain one `onnx.ModelProto` representing
        # `example_model(x: torch.Tensor)`.

        # Remove the pass when not needed. If `record_onnx_model_transform` is not
        # removed, it will be applied to all models compiled by `backend="onnxrt"`.
        unregister_backend_graph_transform(record_onnx_model_transform)
```

In the future, we plan to use this mechanism to register all graph transforms such ash graph fusion and general ONNX optimization for `onnxrt`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120854
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-01 03:24:17 +00:00
67c97a9aad fix the scale dot attention doc (#120859)
Fixes #120810

The code verifies the broadcast behavior (from the issue),
```
import torch

B = 3
S = 5
L = 7
E = 16
EV = 32
additional_batches = [2, 4]

query_shape = [B] + additional_batches + [L, E]
key_shape = [B] + additional_batches + [S, E]
value_shape = [B] + additional_batches + [S, EV]

query = torch.rand(*query_shape)
key = torch.rand(*key_shape)
value = torch.rand(*value_shape)
mask = torch.zeros((1, 1, S), dtype=torch.bool)
mask[:, :, S // 2 :] = True

# query.to("cuda")
# key.to("cuda")
# value.to("cuda")
# mask.to("cuda")

attention = torch.nn.functional.scaled_dot_product_attention(query, key, value, mask)

print(f"query shape = {query.shape}")
print(f"key shape = {key.shape}")
print(f"value shape = {value.shape}")
print(f"mask shape = {mask.shape}")
print(f"attention shape = {attention.shape}")

#in both CPU and cuda, output shape is:
# query shape = torch.Size([3, 2, 4, 7, 16])
# key shape = torch.Size([3, 2, 4, 5, 16])
# value shape = torch.Size([3, 2, 4, 5, 32])
# mask shape = torch.Size([1, 1, 5])
# attention shape = torch.Size([3, 2, 4, 7, 32])

## test add is broadcasting mask to query@(key.mT)
res = query@(key.mT)
print(res.shape)
res2 = torch.add(res, mask)
print(res2.shape)
```

At code level, in the default backend,
ab38354887/aten/src/ATen/native/transformers/attention.cpp (L735)

the add operation is broadcasting the `attn_mask` to `auto attn = at::matmul(query, key.transpose(-2, -1) * scaling_factor);`

- Changed the doc in [torch/nn/functional.py](https://github.com/pytorch/pytorch/pull/120859/files#diff-c358c214f663ba0c8b9c6846fbe0042fa29494cf02fe4714a17dcd0d268b035b).
- Also fixed a few inconsistencies in the cpp comments.

@mikaylagawarecki

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120859
Approved by: https://github.com/drisspg
2024-03-01 02:54:08 +00:00
b35551f357 Ban reset_to_zero argument to triton.autotune in user defined kernels (#120938)
Fixes #120802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120938
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-03-01 02:37:24 +00:00
06f8af30fa Change FakeTensor serialization to consider only an _active_ FakeTensor mode (#120848)
Summary: https://github.com/pytorch/pytorch/pull/108186 make some changes related to FakeTensor serialization such that saving and loading a tensor will give us a meta tensor, even if FakeTensor mode is not enabled. This means we can't properly save and load Tensors as part of Fx graph caching. This PR changes the logic to check if there's an _active_ FakeTensor mode.

Test Plan:
* New unit tests
* Validated unit tests introduced in https://github.com/pytorch/pytorch/pull/108186 still pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120848
Approved by: https://github.com/eellison, https://github.com/thiagocrepaldi
2024-03-01 02:37:21 +00:00
e3dbd194f4 [dynamo] Support module backwards hooks (#120685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120685
Approved by: https://github.com/yanboliang, https://github.com/xmfan
2024-03-01 02:24:26 +00:00
9b2c35b4fe [dynamo] Fix convolution meta kernel when input channel is 0 (#120944)
Addresses https://github.com/pytorch/pytorch/issues/118797

Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0):
67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944
Approved by: https://github.com/zou3519
2024-03-01 01:18:21 +00:00
d534a49767 Reinplace auto_functionalized (#120829)
Fixes https://github.com/pytorch/pytorch/issues/120441

We follow how triton_kernel_wrapper_functional gets re-inplaced:
- If we see auto_functionalized, then first we compute what inputs we
  actually need to clone ("tensors_to_clone") and fixup the graph. This happens in
  `reinplace_and_refine_tensors_to_clone`, which I have refactored out
  of the triton_kernel_wrapper_functional reinplacing code.
- Later on, after the reinplacing pass, we have a decomposition pass for
  auto_functionalized. In that decomposition pass, we make use of the
  "tensor_to_clone" info and only clone those inputs in the
  decomposition.
- We shepherd "tensor_to_clone" from the first step to the second step
  by setting the .meta field on the auto_functionalized node.

Test Plan:
- existing tests
- tested locally by reading the output of TORCH_LOGS="post_grad_graphs"
- added assertExpectedInline tests for the post_grad_graphs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120829
Approved by: https://github.com/oulgen
2024-03-01 00:55:19 +00:00
791f8ef350 [Composable APIs] Add composable API fully_shard deprecation warning (#120929)
`fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py) will be used by new FSDP2 and we want to add a deprecation warning to the existing composable API's `fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fully_shard.py#L40).

Planned release schedule is as follows https://dev-discuss.pytorch.org/t/release-cadence-for-year-2023-2024/1557:

Minor Version | Release branch cut | Release date | First patch release date | Second patch release date
-- | -- | -- | -- | --
2.3 | Mar 2024 | Apr 2024 | May 2024 | Jun 2024
2.4 | May 2024 | Jul 2024 | Aug 2024 | Sep 2024
2.5 | Aug 2024 | Oct 2024 | Nov 2024 | Dec 2024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120929
Approved by: https://github.com/awgu
2024-03-01 00:55:16 +00:00
fd35aafc26 Teach dynamo about vjp (#119405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119405
Approved by: https://github.com/zou3519
ghstack dependencies: #118407
2024-03-01 00:21:10 +00:00
9d5dea7812 [DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
as title

Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816
Approved by: https://github.com/fegin
2024-03-01 00:21:05 +00:00
33da8d5c12 Revert "Fix guard for SUPPORTED_NODES (#120798)"
This reverts commit 1b8bb027f676aa8c4260a3f6b9a5c98c37d25dc7.

Reverted https://github.com/pytorch/pytorch/pull/120798 on behalf of https://github.com/kit1980 due to the new test fails internally, see D54343456 ([comment](https://github.com/pytorch/pytorch/pull/120798#issuecomment-1972134227))
2024-02-29 23:19:22 +00:00
7ebfe21724 Fix nll loss dynamo failure (#120805)
Fix for https://github.com/pytorch/pytorch/issues/119791 Part of dynamo bug bash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120805
Approved by: https://github.com/Skylion007, https://github.com/zou3519, https://github.com/malfet
2024-02-29 22:34:49 +00:00
d03b11ad5b Pass inductor strides forward in ddp optimizer (#120523)
# Note: Returning Fake Tensors on First AOT Autograd Call
            #
            # Inductor will optimize strides of outputs when it deems it profitable.
            # For instance, converting to channels last. When we split the graph here
            # into multiple inductor compilations, we need to make sure that the
            # output strides of one compilation is appropriately passed to the subsequent
            # compilations. However, the mapping from inductor output to dynamo output
            # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing,
            # subclass handling, etc. In order to replay all this logic we set a flag such that
            # the first invocation of inductor in aot_autograd will return Fake Tensors with
            # appropriate strides. Then, all of aot autograd's runtime logic is replayed.
            # This gives us the appropriately strided outputs here which will reflect runtime strides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523
Approved by: https://github.com/yf225, https://github.com/bdhirsh
2024-02-29 22:25:00 +00:00
772db2a3ae Fix handling of torch.return_types in dynamo (#120826)
Handle quasi-namedtuples as a special case in dynamo

Fixes #120651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120826
Approved by: https://github.com/anijain2305
2024-02-29 22:11:35 +00:00
da559c98e3 Fix isin decomp and add python meta registration (#120821)
Fixes #119792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821
Approved by: https://github.com/malfet, https://github.com/peterbell10
2024-02-29 22:08:50 +00:00
76d3a6bb4a Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)"
This reverts commit 381a7ad3f1cd38bf8e814ae9d275f101a2136139.

Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))
2024-02-29 22:06:13 +00:00
e7039e3a0b [dynamo][easy] Dynamo test changes (#120927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120927
Approved by: https://github.com/yanboliang
ghstack dependencies: #120864, #120730
2024-02-29 22:05:41 +00:00
39c092d242 Skip semi-structured-sparse on windows (#120807)
# Sumary

We can see that in this job on the other PR: https://github.com/pytorch/pytorch/actions/runs/8086597674/job/22096699337?pr=120641#step:11:11272

building the SemiStrucutredSparse kernel is erroring on windows machine so I think we she land this.

### Details

Introduced in here:  https://github.com/pytorch/pytorch/pull/120434

we don't compile for windows so we should have skipped this test.

There is another PR: https://github.com/pytorch/pytorch/pull/120641
which removes this skip for windows, so if that is green we should do that otherwise skip windows tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120807
Approved by: https://github.com/alexsamardzic, https://github.com/jcaip
2024-02-29 21:48:52 +00:00
1a1f58ffbe [rocm][cmake] retrieve rocm location from ROCM_SOURCE_DIR env if specified (#120898)
This PR allows us to build PyTorch with a rocm that is not installed
to the default location, i.e. /opt/rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120898
Approved by: https://github.com/jianyuh
2024-02-29 21:32:45 +00:00
b2dddcfe27 [FSDP2][DCP][DSD] Add test to ensure FSDP2 model/optim state_dict work after a full training loop (#120871)
This PR adds tests to test distributed state dict work properly for FSDP2's model and optimizer state_dict after a full training loop.

We test the combination of these options on a evenly sharded model.
```
{
    "reshard_after_forward": [True, False],
    "optimizer_class": [torch.optim.Adam],
    "compile_model": [True, False],
},
```

Followup: 1. Add test for unevenly sharded model. 2. Add test to include `torch.optim.AdamW` (seems to have some gaps currently, still investigating)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120871
Approved by: https://github.com/fegin
2024-02-29 21:24:00 +00:00
67d3e4f2a2 [TorchElastic] Refactoring to support non-default logging strategy (#120691)
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)

Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json

In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.

With current changes, users can create custom log spec that can use env variables to change the behavior.

Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.

Test Plan: CI + unit tests

Differential Revision: D54176265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
2024-02-29 20:59:17 +00:00
277bc97709 [FSDP2][ez] Combined communication test files (#120904)
This just combines the unit tests for the collectives ops for copy-in/all-gather/copy-out and copy-in/reduce-scatter/view-out with the unit tests for communication schedule. I was mainly thinking to try to not have too many test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120904
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #120659
2024-02-29 20:36:04 +00:00
0b924d7cde Revert "[inductor] Optimize welford reduction (#120330)"
This reverts commit 7eb7ac815f0247a62b621897cea95ec4ca56d52e.

Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/kit1980 due to Broke internal tests, see D54230858 ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1971878323))
2024-02-29 20:12:50 +00:00
0a7666801d SymIntify prod_backward (#120776)
Fixes https://github.com/pytorch/pytorch/issues/120608

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120776
Approved by: https://github.com/albanD
2024-02-29 20:05:22 +00:00
313abcdba2 [c10d] fix the unwanted reason (#120863)
Summary:
Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false
postiive errors. This is a quick fix, but we need to revisit the error
handling logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863
Approved by: https://github.com/kwen2501
2024-02-29 19:58:11 +00:00
f94933ed42 Refine value ranges on inequalities (#120800)
This is basically done the obvious way. For better or worse, I jammed this into what used to be `_maybe_guard_eq` but now is `_maybe_guard_rel`. I was careful to test all the off by one conditions, and each permutation. Let me know if you think I missed anything. Importantly, this now works for unbacked SymInts.

While testing, I noticed we are silently duck sizing all symbolic variables in `test_dynamic_shapes.py`. This may or may not be covering up bugs.

Along the way, I had to fix a bug in export constraints, where we weren't checking that the final var_to_range was consistent with what the user requested at top level.

After I implemented all this, I realized that applying this to non-unbacked SymInts was duplicative with @ysiraichi's previous work on https://github.com/pytorch/pytorch/pull/97963 . The upside is I now understand what Yukio was trying to do in the original PR, and I think my new logic is simpler and less error prone. In Yukio's earlier diff, Yukio tried very hard to avoid changing what guards we actually issue (since this would cause tests to wobble). Thus, when he refined a range, he also saved the guard that actually caused the range to refine. In this PR, I don't bother saving these guards; instead I just tighten var_to_range directly and rely on generating guards on this to be correct. The key insight is that if I assert `x < y`, it's always safe to emit (potentially) more restrictive range guards, because this won't invalidate our guards, it will just make them a little too strong (but actually, I think we are precise along the way.) If these guards make it unnecessary to test `x < y`, because now the ranges for x and y are disjoint, this is fine, we've subsumed the x < y guard and can just not bother testing it. If I've gotten it right, TV will agree with me.

In fact, I had a bug in this PR which TV didn't catch, which is that when we have a recorded var_to_guards for a symbol, we unconditionally never generate the range guard for it, even if the var_to_guards is potentially inconsistent with var_to_range (because var_to_range was updated separately). With var_to_guards removed, I don't have to worry abou this inconsistency.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120800
Approved by: https://github.com/Skylion007, https://github.com/avikchaudhuri, https://github.com/ysiraichi
2024-02-29 19:41:51 +00:00
81c4c0dda2 [functional collecitve] don't import torchdynamo when running torchdeploy (#120900)
Summary: Importing torchdynamo in `functional_collective_impl.py` seems to break loading of torchdeploy models.

Test Plan: CI

Differential Revision: D54355011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120900
Approved by: https://github.com/fegin
2024-02-29 19:20:54 +00:00
f7a809c96a fix dupe deprecated warning in dynamo export (#120896)
Summary:
When we convert `dynamic_shapes` to `constraints` and pass them to `_dynamo.export`, we shouldn't give a deprecation warning. Such conversion happens when calling `torch.export.export`, e.g. But it can also happen when calling `capture_pre_autograd_graph` (which itself has this deprecation warning when `constraints` are passed directly as well).

Since `_log_export_usage` is an indicator of a top-level call (it is `True` by default but set to `False`, or at least passed through, by callers), we can (ab)use it to indicate when to give this deprecation warning.

Test Plan: none

Differential Revision: D54350172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120896
Approved by: https://github.com/BoyuanFeng, https://github.com/zhxchen17
2024-02-29 18:57:42 +00:00
0290fe65bd Test TD (test removal) on crossref (#119426)
Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut.
test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min)

The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426
Approved by: https://github.com/huydhn
2024-02-29 18:53:43 +00:00
1458f1de66 Revert "Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)"
This reverts commit 4b7a521856ca5fb0fc28edd18591f77fff5a6ba1.

Reverted https://github.com/pytorch/pytorch/pull/118935 on behalf of https://github.com/atalman due to Significantly increases build time. Optimization is needed ([comment](https://github.com/pytorch/pytorch/pull/118935#issuecomment-1971723284))
2024-02-29 18:42:21 +00:00
96eff4ef70 [inductor max autotune] Detailed autotuning result logs ( machine-readable ) (#119004)
This diff introduces a new separate logging of autotuning results,
with the intention of making the results analyzable, specifically
those for the new experimental Cutlass backend.

Results are logged as text files with one JSON document corresponding to a single benchmark result per line.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119004
Approved by: https://github.com/jansel
ghstack dependencies: #120620
2024-02-29 18:24:13 +00:00
a911eb74ae [dynamo] Graph break when faking named tensors (#120779)
Fixes #120644
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120779
Approved by: https://github.com/zou3519
2024-02-29 18:22:15 +00:00
1104e0798c [Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812)
Fixes #118793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812
Approved by: https://github.com/zou3519
2024-02-29 18:19:14 +00:00
ca679384c2 [rocm][cmake] correctly check the ROCM_SOURCE_DIR environment (#120858)
The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be
not working correctly, e.g.

```
$ cmake --version
cmake version 3.26.4

$ cat CMakeList.txt
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(FOO)

if(NOT ENV{ROCM_SOURCE_DIR})
  message(INFO ": not defined 1")
else()
  message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}")
endif()

if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "")
  message(INFO ": not defined 2")
else()
  message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}")
endif()
$ ROCM_SOURCE_DIR=/tmp cmake .
INFO: not defined 1
INFO: defined 2: /tmp
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yangche/tmp/tmp
```

This PR replace it with a STREQUAL check. Note that the choice
of STREQUAL is to avoid cases like:

```
$ ROCM_SOURCE_DIR= cmake .
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858
Approved by: https://github.com/jianyuh, https://github.com/jeffdaily
2024-02-29 17:49:00 +00:00
9e016debeb [dynamo] Fix inference_mode context variable (#120830)
<idk what im doing>
Fixes #120646

The module for torch.inference_mode should be torch

The input to `create` is a bool (mode?) and `_enter_inference_mode` expects a bool but [BlockStackEntry](50073248ed/torch/_dynamo/symbolic_convert.py (L206)) expects `target_values` to be a list?
[inference_mode](50073248ed/torch/autograd/grad_mode.py (L205))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120830
Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/tugsbayasgalan
2024-02-29 17:10:06 +00:00
98c4ba683e [EZ][BE] Fix ResourceWarning (#120886)
By closing the file handle

Fixes
```
/Users/nshulga/git/pytorch/pytorch/test/quantization/core/test_docs.py:132: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/nshulga/git/pytorch/pytorch/docs/source/quantization.rst' mode='r' encoding='UTF-8'>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120886
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-02-29 17:07:39 +00:00
664dd61b29 Add some more symbolic shapes related files to ciflow/inductor (#120887)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120887
Approved by: https://github.com/janeyx99, https://github.com/malfet
2024-02-29 16:59:32 +00:00
558316b5f4 Emit grid wrapper inlined with the user defined triton kernel (#120824)
Fixes #120801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120824
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #120809
2024-02-29 16:17:45 +00:00
84e2accd6c Make triton_meta be part of user defined triton kernel cache (#120809)
Tensors with different shapes will generate different triton meta (divisibility rules), we need this to be part of the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120809
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-02-29 16:17:45 +00:00
342e7929b8 [export] kill deprecated constraints API (#120860)
Summary:
Previously `export` would take `constraints` built with `dynamic_dim(...)`s. This has been deprecated for a while; one can now pass in a `dynamic_shapes` spec built with `Dim(...)`s.

Here we kill this deprecated API. Eventually this will lead to simplification of the underlying implementation, since the new `Dim`-based specs can map 1-1 with symbolic shapes concepts without going through indirect machinery of `dynamic_dim`-based constraints. It is expected that internal APIs like `_dynamo.export` and `_trace._export_to_torch_ir` will change when that happens.

Leaving `aot_compile` and `capture_pre_autograd_graph` entry points alone for now. This will eventually be updated anyway.

Test Plan: updated tests

Differential Revision: D54339703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120860
Approved by: https://github.com/suo, https://github.com/tugsbayasgalan
2024-02-29 16:15:50 +00:00
3cfed01228 [AOTI] Store OpOverload in ir.ExternKernel (#120629)
Summary: Currently the logics for filling the default value for optional arguments are scattered in several places. By storing OpOverload in the base ExternKernel class, we can simplify codegen_kwargs, and this is a preparation step for enabling the torchgen-ed C shim. The default value filling logic for FallbackKernel can also be simplified, but that can come later.

Differential Revision: [D54258089](https://our.internmc.facebook.com/intern/diff/D54258089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120629
Approved by: https://github.com/chenyang78
ghstack dependencies: #119987, #120592
2024-02-29 15:51:33 +00:00
fa7241ed79 [AOTI] Change the cpp wrapper codegen for sdpa (#120592)
Summary: Switch codegen for sdpa to always point to v2 in the C shim. Since aoti_torch__scaled_dot_product_flash_attention_v2 has been introduced for a while, there shouldn't be any FC issue in production.

Differential Revision: [D54258090](https://our.internmc.facebook.com/intern/diff/D54258090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120592
Approved by: https://github.com/chenyang78
ghstack dependencies: #119987
2024-02-29 15:49:23 +00:00
52e3c78a43 [AOTI][refactor] Move a few util functions in atoi_torch (#119987)
Summary: Move these util functions from an anonymous namespace to a common header so that later torchgen-ed files can use them.

Differential Revision: [D54258088](https://our.internmc.facebook.com/intern/diff/D54258088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119987
Approved by: https://github.com/chenyang78
2024-02-29 15:46:47 +00:00
5b9e5f854b [profiler] Log process group id instead of backend id (#120475)
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.

However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.

Differential Revision: D53558257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
2024-02-29 15:04:33 +00:00
576c0482a5 Remove hard numpy dependency from guards.py (#119519)
I'm not sure if this is the ideal behavior / best fix for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119519
Approved by: https://github.com/albanD
2024-02-29 14:37:33 +00:00
5db5049b34 Move TRITON_CONSTRAINT setting to common binary_populate_env.sh, BE - Cleanup unused build scripts (#120744)
1. This moves TRITON_CONSTRAINT to common binary_populate_env.sh so that this is set for all wheels.
test in CI via ``ciflow/binaries`` label. Please note we only setting this constraint when PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set. And this variable is set for all the wheels that gets uploaded to pypi. Hence triton wheels need to be set at the same place.
This is done for regular wheels and rocm wheels separately, since rocm wheels using different triton package

3. Cleanup legacy unused code
Test:
``
git grep setup_linux_system_environment.sh
``

Needs: https://github.com/pytorch/builder/pull/1712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120744
Approved by: https://github.com/huydhn
2024-02-29 14:25:34 +00:00
f988f649be [IntraNodeComm] accept P2P buffer size as constructor argument (#120856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120856
Approved by: https://github.com/wanchaol
ghstack dependencies: #120855
2024-02-29 11:43:52 +00:00
22b5548f5d [IntraNodeComm] refactor all_reduce variants as private methods (#120855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120855
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-02-29 11:43:52 +00:00
96793e0f10 [ROCm] enable scaled_gemm (#117822)
scaled_gemm for ROCm using hipblaslt.  As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported.  A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result.  For this reason the feature should be considered beta/preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
2024-02-29 10:20:48 +00:00
09aefe1502 Fix ouput typos (#120870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870
Approved by: https://github.com/clee2000
2024-02-29 08:29:14 +00:00
14c5ebc8a1 [Dynamo] Do not attempt to make nditer spawned arrays writable (#120868)
As they are not, converting `numpy.nditer` to writable is too expensive and  tensor values are copied anyway

Minimal reproducer:
```python
import numpy as np
import torch

@torch.compile
def f(x):
    return x + 1.0

for x in np.nditer(np.arange(3)):
    print(f(x))
```

Fixes https://github.com/pytorch/pytorch/issues/119787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120868
Approved by: https://github.com/jansel
2024-02-29 07:49:59 +00:00
169c220bf8 [torch.compile] Provide capability to register callback on compile start/stop (#120764)
This is a requirement from Meta internal cases, where ppl wants to register a callback function to detect if a job is stuck during compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120764
Approved by: https://github.com/jansel
2024-02-29 07:37:52 +00:00
82cbd9b131 [dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120730
Approved by: https://github.com/jansel
ghstack dependencies: #120864
2024-02-29 07:25:13 +00:00
66d05a8900 [dynamo] Fix source for default dict default_factory (#120864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120864
Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel
2024-02-29 07:25:13 +00:00
df1e855313 [fake_impls] fix max_seqlen return values in efficient_attention_forward (#120842)
To match the actual implementation, we should return the max_seqlen_q/k, not M, N, when in the sparse case

7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L981-L996)

Note that although the .cu file sets max_seqlen_k = 0 in the sparse case, it actually returns max_seqlen_k or N:

7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L1224-L1231)

Tests - added in the next PR (#102839, which also fixes other parts of the test_fake tests so that we can un-xfail them and actually run the tests)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120842
Approved by: https://github.com/YuqingJ
ghstack dependencies: #120682
2024-02-29 07:12:27 +00:00
eqy
d1d50d2e4c [Inductor][cuDNN] Disable tf32 in test_mutate_base_for_conv_output (#120867)
Looks like there is a sum? comparison where TF32 may not provide the necessary accuracy, leading to failures on sm86.

CC @Skylion007 , hopefully this unblocks #120642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120867
Approved by: https://github.com/Skylion007
2024-02-29 06:59:32 +00:00
cyy
8a42cff7b1 [DeviceIndex][7/N] Use DeviceIndex in XPU (#120576)
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120576
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2024-02-29 05:54:23 +00:00
4b18ab869f [torch.export] Support is_compiling() flag for non-strict mode (#119602)
Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models.

Test Plan: Unit tests and manual testing.

Differential Revision: D53624452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602
Approved by: https://github.com/suo
2024-02-29 05:52:51 +00:00
0a46102b37 Add equal_to_1 to triton_meta for user-written Triton kernels (#120579)
Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization.

Fixes #120478. The repro from the issue, on A100:

Before this PR:

```
Triton matmul:           0.0167 seconds
Triton matmul compiled:  0.0751 seconds
```

After this PR:

```
Triton matmul:           0.0168 seconds
Triton matmul compiled:  0.0072 seconds
```

Test Plan:

```
$ python test/dynamo/test_triton_kernels.py -k  test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 3 tests in 3.545s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78
2024-02-29 05:19:39 +00:00
4407138bf6 [inductor][eazy] fix a typo in test (#120832)
In theory we can test anything, but the test name mentioned attention so we should multiple the inv_scale rather than divide it. And I guess that the initial intension of the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120832
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-02-29 05:04:04 +00:00
2d17230212 [inductor] Do not reuse buffers across scopes in mem planning (#120777)
Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes.

In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state.

Test Plan:

```
python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters
...
----------------------------------------------------------------------
Ran 27 tests in 149.413s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #120665
2024-02-29 03:52:02 +00:00
f5b99976ad [C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850)
Fixes #120847

Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2024-02-29 03:41:15 +00:00
26d6ddc232 [bug burndown]Fix #119784 (#120846)
Addresses https://github.com/pytorch/pytorch/issues/119784. Interestingly, the test seem to just pass (yay!). Tested locally that the failing set of tests pass using `PYTORCH_TEST_WITH_DYNAMO=1 pytest functorch/test_vmap.py -v`

Will wait for CI to pass first before bugging people for reviews.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120846
Approved by: https://github.com/Skylion007
2024-02-29 03:30:40 +00:00
fad228c7cc Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833)
Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition.

This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833
Approved by: https://github.com/wconstab
2024-02-29 03:19:44 +00:00
2c0c70f763 [Dynamo] enumerate imported names for eval_frame.py (#120778)
Fixes https://github.com/pytorch/pytorch/issues/120699 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120778
Approved by: https://github.com/Skylion007
2024-02-29 03:08:43 +00:00
ef9e89984c [pytorch] Support output types that are non tensors (#120804)
Summary:
per title
This is needed because some modules return None and non tensors as output

Test Plan: sandcastle?

Reviewed By: zhxchen17

Differential Revision: D54311609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120804
Approved by: https://github.com/zhxchen17
2024-02-29 02:49:10 +00:00
0dbef1618f [inductor] Apply fx passes recursively to nested subgraphs (#120665)
Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`.

In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph.

For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 26 tests in 59.252s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665
Approved by: https://github.com/eellison
2024-02-29 02:34:54 +00:00
db1cc781db Revert "[dynamo] Function => FunctionCtx for placeholder obj (#120577)"
This reverts commit ee01d0807b924874a329be78c6ee880f556645db.

Reverted https://github.com/pytorch/pytorch/pull/120577 on behalf of https://github.com/jansel due to Causing breakages internally ([comment](https://github.com/pytorch/pytorch/pull/120577#issuecomment-1970254363))
2024-02-29 01:56:09 +00:00
b2e4b621cc Reduce create_env log level to DEBUG (#120772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120772
Approved by: https://github.com/albanD
2024-02-29 01:33:16 +00:00
9e0631cc8a get CommsDebugMode to work with DTensor (#118769)
Tested with Wanchao's repro:
```
from typing import Tuple, List, Dict, cast
import torch
import torch.nn as nn
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed._tensor import distribute_tensor, DTensor, Shard, Placement, Replicate

mesh = init_device_mesh(device_type="cuda", mesh_shape=(2,))
x = torch.randn(4, 8, requires_grad=True)
y = torch.randn(4, 32, requires_grad=True)
x_dtensor = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
y_dtensor = DTensor.from_local(y, mesh, [Shard(0)], run_check=False)
from torch.distributed._tensor.debug import CommDebugMode
comm_mode = CommDebugMode()
with comm_mode:
    z = torch.mm(x_dtensor, y_dtensor)
print(comm_mode.get_comm_counts())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118769
Approved by: https://github.com/wanchaol
2024-02-29 01:11:05 +00:00
381a7ad3f1 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
ghstack dependencies: #120724, #120270
2024-02-29 01:03:31 +00:00
f85d3a022c [C10D] Fix pointToPoint op Flight Recording (#120270)
Fix and test issues with both coalesced and individual send/recv ops

Considered an alternate approach and then ditched it
 - alternate approach: #119757
 - reason ditched: prefer recording individual collective events inside
   coalescing region instead of just the event at the end of the region,
   which also would not have tensor sizes or opnames without additional
   state variables added

Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
  recording when recording in workEnqueue.  Adding the info onto the
  Work obj would be possible, but adds to overhead of copying Works
  which we do on every collective. We can get info off the input/output
  tensors directly in initWork, but we don't want to keep refs to those
  tensors alive while the work is Enqueued, so we'd have to specifically
  copy size lists or something.

This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued.  But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.

It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().

The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
2024-02-29 01:03:31 +00:00
7f4d673885 [C10D] Add record_id to flight recorder (#120724)
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.

Note: the records are already in a list, so their ID could be implicitly
observed.  But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
2024-02-29 01:03:31 +00:00
950b484356 skip three pyhpc models with dynamic shape test (#120599)
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.

* Error msg is
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```

* Root Cause is
  *  Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
  * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
  ```
    shape = (
        math.ceil(2 * size ** (1/3)),
        math.ceil(2 * size ** (1/3)),
        math.ceil(0.25 * size ** (1/3)),
    )
  ```
  * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.

* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-02-29 00:38:06 +00:00
3179107629 [DDP][PT2D] Ignore gradient sync if the gradient is not defined (#120419)
From the test, accum_grad_hook can still be fired even if the gradient is None. We need to ignore the gradient sync for this case.

Differential Revision: [D54076485](https://our.internmc.facebook.com/intern/diff/D54076485/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120419
Approved by: https://github.com/yf225, https://github.com/XilunWu
2024-02-29 00:27:54 +00:00
ab38354887 Allow str inputs in non-strict tracing (#120536)
Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass.

Failing test case:
```
class Foo(torch.nn.Module):
            def forward(self, a, b, mode):
                return torch.div(a, b, rounding_mode=mode)

        foo = Foo()
        inps = (torch.randn(4, 4), torch.randn(4), "trunc")
        exported = export(foo, inps)
        with self.assertRaisesRegex(
            RuntimeError, "to be equal to trunc, but got floor"
        ):
            _ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor")
        self.assertTrue(torch.allclose(exported.module()(*inps), foo(*inps)))
```

Before:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
E
======================================================================
ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper
    method(*args, **kwargs)
  File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str
    exported = export(foo, inps)
               ^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export
    return export(*args, **kwargs, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export
    return _export(
           ^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper
    raise e
  File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper
    ep = fn(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export
    ) = make_fake_inputs(f, args, kwargs, constraints)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs
    fake_args, fake_kwargs = tree_map_with_path(
                             ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path
    return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten
    leaves = list(leaves)
             ^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr>
    return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
                              ^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda>
    lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify
    raise ValueError("Only tensors allowed as input")
ValueError: Only tensors allowed as input

To execute this test, run the following from the base repo dir:
     python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.008s

FAILED (errors=1)
```

After:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
.
----------------------------------------------------------------------
Ran 1 test in 0.237s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm
2024-02-28 23:56:30 +00:00
1b8bb027f6 Fix guard for SUPPORTED_NODES (#120798)
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.

This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module.  It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.

Also added a unit test which fails before this change and passes after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
2024-02-28 23:34:17 +00:00
aa36821615 [Memory Snapshot] Stop clearing history when changing context (#120436)
Summary:
This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`.

Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history.

Test Plan:
# Ran on the following local Resnet50 example:

- At iteration=0, record_memory_history(context=None, stacks="python")
- At iteration=3, record_memory_history(context="all", stacks="python")
- After iteration=4, export_memory_snapshot()

## Before:
 - Only collects the last 2 iterations with python call stacks.
![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3)

## After:
 - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks.
![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8)
![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4)

Differential Revision: D54084017

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436
Approved by: https://github.com/zdevito, https://github.com/leitian
2024-02-28 22:46:26 +00:00
86ff31c4a0 Revert "Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)"
This reverts commit cabc09a5f259f1cc1e3bad1d80b5e5274838bced.

Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))
2024-02-28 22:30:18 +00:00
dbe0967a0a Revert "Add test to check that COW inputs are not materialized (#119507)"
This reverts commit 2ebf2c88baa4667d55eda92f4c8424db505af781.

Reverted https://github.com/pytorch/pytorch/pull/119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/119507#issuecomment-1970022840))
2024-02-28 22:26:59 +00:00
7e185277cd [cuDNN] bump cuDNN-frontend submodule to 1.1.2 (#120761)
Hopefully resolves additional `CUDNN_STATUS_SUCCESS` failures that we have been seeing on H100 (though curiously not on upstream CI, perhaps due to the different hardware being tested)

Need to confirm the fix on our end before merging

CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120761
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
2024-02-28 22:15:43 +00:00
9c9bde515c Factor out Submod compilers (#120527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120527
Approved by: https://github.com/kadeng
2024-02-28 22:11:47 +00:00
5b5bcf0470 Test that tlparse understands the structured logs we output (#120658)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120658
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #120712, #120289
2024-02-28 21:58:39 +00:00
d6c202975c Move attention kernels from meta_registrations to fake_impls (#120682)
This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns.

This PR:
* Move the `_meta_registrations.py` implementations to `fake_impls.py`
* Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them
* Wrap all the returned tensors in FakeTensors

Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682
Approved by: https://github.com/drisspg
2024-02-28 21:49:13 +00:00
50073248ed add a note wrt torch.nn.functional.scaled_dot_product_attention (#120668)
followup change of https://github.com/pytorch/pytorch/pull/120565

- Added a note in the transformer class pointing out the mask definition is opposite to that of :attr:`attn_mask` in
            torch.nn.functional.scaled_dot_product_attention.
@mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120668
Approved by: https://github.com/mikaylagawarecki
2024-02-28 21:16:34 +00:00
e2ee87d48b Fix segfault on mac when running vulkan tests (#120337)
Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes.

Test Plan:
See that there is no segfault on mac with fix and tests can run:

Compile gtests:
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Crash w/o diff
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
[       OK ] VulkanAPITest.uniform_buffer_copy (88 ms)
[ RUN      ] VulkanAPITest.copy_to_buffer
Segmentation fault: 11

With diff there is no crash:
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
[       OK ] VulkanAPITest.uniform_buffer_copy (296 ms)
.....
[  FAILED  ] VulkanAPITest.gelu_quint8_self (23 ms)
[----------] 85 tests from VulkanAPITest (1494 ms total)

[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1494 ms total)
[  PASSED  ] 72 tests.
[  FAILED  ] 13 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large
[  FAILED  ] VulkanAPITest.gelu_qint8
[  FAILED  ] VulkanAPITest.gelu_qint8_self
[  FAILED  ] VulkanAPITest.gelu_quint8
[  FAILED  ] VulkanAPITest.gelu_quint8_self

The above failing tests were failing before as well and are being worked on.

Differential Revision: D54023146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337
Approved by: https://github.com/SS-JIA
2024-02-28 20:55:47 +00:00
e317e39a02 Fix nonlinearity arg issue in RNN (#120234)
Fixes #114617

This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg.

Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234
Approved by: https://github.com/mikaylagawarecki
2024-02-28 20:53:18 +00:00
8b22fe9594 [FX passes] Set group/batch fusion log to DEBUG level (#120780)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120780
Approved by: https://github.com/jackiexu1992
2024-02-28 20:48:11 +00:00
4903e33e19 Revert "Capture non tensor arguments in record_function (#120017)"
This reverts commit 5c5b71b6eebae76d744261715231093e62f0d090.

Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))
2024-02-28 20:43:33 +00:00
01ec8df6d8 [Compiled Autograd] Introduce BackwardState capture (#120382)
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)

We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
2024-02-28 20:36:47 +00:00
c016ffed5b [C10D] Fix logic for default group=None in _set_pg_timeout (#120686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686
Approved by: https://github.com/yifuwang
2024-02-28 20:31:14 +00:00
11de40f82f [flight recorder] record process group configuration (#120262)
Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging.

Differential Revision: D53792087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262
Approved by: https://github.com/shuqiangzhang
2024-02-28 20:31:08 +00:00
5aa7f8646f [inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120742)
Relanding https://github.com/pytorch/pytorch/pull/120639 + a fix to drop `matrix_instr_nonkdim` that does not align with `BLOCK_M` or `BLOCK_N`

Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 0 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x.

Before:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4%
  SingleProcess AUTOTUNE takes 8.1153 seconds
```

After:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2%
  SingleProcess AUTOTUNE takes 11.4076 seconds
```

Before:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6%
  SingleProcess AUTOTUNE takes 3.4052 seconds
```

After:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8%
  SingleProcess AUTOTUNE takes 11.3538 seconds

```

Before:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6%
  SingleProcess AUTOTUNE takes 9.0523 seconds
```

After:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2%
  SingleProcess AUTOTUNE takes 8.2225 seconds
```

Before:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7%
  SingleProcess AUTOTUNE takes 11.0074 seconds
```

After:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4%
  SingleProcess AUTOTUNE takes 14.9839 seconds
```

Reviewed By: xw285cornell, nmacchioni

Differential Revision: D54203170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120742
Approved by: https://github.com/xw285cornell
2024-02-28 20:27:14 +00:00
b020ee5b05 [PyTorch Use MaybeOwned when promoting indices/offsets in embedding_bag (#120755)
We're currently doing two unnecessary reference count
operations in the case where promotion doesn't need to happen.

Differential Revision: [D54285999](https://our.internmc.facebook.com/intern/diff/D54285999/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120755
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #120752
2024-02-28 20:13:30 +00:00
98d1529474 [PyTorch] fix mixed int32/int64 indices/offsets for embedding_bag_out (#120752)
This was an oversight in D27482738 (#55189) -- it only patched the regular embedding_bag operator, but static runtime uses the out variant.

Differential Revision: [D54285460](https://our.internmc.facebook.com/intern/diff/D54285460/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120752
Approved by: https://github.com/houseroad
2024-02-28 20:13:30 +00:00
db92558229 [codemod][lowrisk] Fix deprecated use of 0/NULL (#120740)
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D54163060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740
Approved by: https://github.com/Skylion007
2024-02-28 20:13:13 +00:00
491c2b4665 Let torch dynamo inline torch.func.grad (#118407)
When dynamo sees torch.func.grad, it tries to inline all frames related
to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407
Approved by: https://github.com/zou3519
2024-02-28 20:05:00 +00:00
5472923998 derived dim (#118729)
With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed.

Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even).

This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2*dim`, or even `2*dim` and `dim + 1`.

We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec.

Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more.

Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols.

Differential Revision: D53254587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729
Approved by: https://github.com/ezyang
2024-02-28 19:48:32 +00:00
9c55aa6ff6 TransformerEncoder/Decoder: add type hints (#120550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120550
Approved by: https://github.com/mikaylagawarecki
2024-02-28 19:36:08 +00:00
4b7a521856 Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)
# Summary
Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5).

The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935
Approved by: https://github.com/cpuhrsch
2024-02-28 19:31:15 +00:00
a9d9077f12 Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)"
This reverts commit 7c556428c74a79c6d9c272826344a0828d3f66f5.

Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))
2024-02-28 18:57:09 +00:00
1c67f6cb26 fix decomposition of aten.diag_embed (#120549)
Fixes #117019
Make the input that one dim negative and the other nonnegative be correctly solved in decomposition of `aten.diag_embed`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120549
Approved by: https://github.com/Dalian991, https://github.com/janeyx99
2024-02-28 18:48:01 +00:00
f422467ccb [BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625)
When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice.

However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base.

Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625
Approved by: https://github.com/wconstab, https://github.com/yifuwang
2024-02-28 18:34:45 +00:00
91190d8087 [quant][pt2e] Relax model_is_exported input (#120720)
Summary: This commit relaxes the `model_is_exported` API to
additionally work for `torch.nn.Module`s in addition to just
`torch.fx.GraphModule`s, simplifying downstream uses.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported

Differential Revision: [D54263935](https://our.internmc.facebook.com/intern/diff/D54263935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120720
Approved by: https://github.com/tugsbayasgalan
2024-02-28 18:32:03 +00:00
f67c77c497 Update engine.cpp (#120773)
Minor comment fix; `backward` and `grad` are flipped here. See https://pytorch.org/docs/stable/_modules/torch/autograd.html#backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120773
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/soulitzer
2024-02-28 18:23:35 +00:00
0ab2ec3738 [XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.

# Motivation

The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.

# Principles

1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.

# Solutions

### a. Pathway Identification:

Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.

### b. `use_device` Logic Revision:

With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.

### c. Kernel List Segregation:

To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.

### d. Formatted Output:

To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.

# Additional Enhancements

### a. Enumerations in `.pyi` Files:

Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.

### b. Correct DeviceType Returns:

Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.

### c. Bug Fixes in `cuda_corr_map`:

Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.

# Further Abstraction

Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.

# Next Pull Request

The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.

We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-02-28 17:50:32 +00:00
3e8b56d362 [Inductor] Track constant's original_fqn mapping (#120524)
When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing.

```
        constants_info_[0].name = "L__self___w_pre";
	constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32);
	constants_info_[0].offset = 0;
	constants_info_[0].data_size = 64;
	constants_info_[0].from_folded = false;
	constants_info_[0].shape = {4, 4};
	constants_info_[0].stride = {4, 1};
	// constants_info_[0].original_fqn = "w_pre";   // this line is missing
```

Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling
an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing.

As a fix, I maintain AOTI's own mapping for constant tensor's fqn.

Differential Revision: D54097073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524
Approved by: https://github.com/chenyang78
2024-02-28 17:36:29 +00:00
702e82da28 [cuDNN][Flash Attention] Minor cleanup for cuDNN SDPA (#120750)
Cleaning up before hopefully starting work on backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120750
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-02-28 17:32:07 +00:00
364faafe75 [DCP] Asserts CPU backend for async_save (#120241)
If a CPU device is not present, collectives will hang in the threaded case due to: https://github.com/pytorch/pytorch/issues/115861

This PR asserts a CPU device is enabled in the pg group backend.

Differential Revision: [D53952864](https://our.internmc.facebook.com/intern/diff/D53952864/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120241
Approved by: https://github.com/fegin
2024-02-28 17:21:30 +00:00
c8a34a4013 [ez] Smaller weight for some TD heuristics (#120736)
Normalize to different number for the fuzzier heuristics

Could this be done as a weighting elsewhere? Yes, but putting it here since I'm not sure which object would hold it best
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120736
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-02-28 17:07:45 +00:00
dfe7b9d471 Move user defined triton tests to inductor test folder (#120738)
Summary: FBCode CI does not compile torch with CUDA for tests in dynamo folder, instead of adding a special rule, lets move these tests to inductor folder.

Test Plan:
```
buck run mode/opt //caffe2/test/inductor/:triton_kernels
```
now works instead of skipping tests

Differential Revision: D54280629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120738
Approved by: https://github.com/aakhundov
2024-02-28 17:03:41 +00:00
df40847486 Add xpu header to include/ATen/xpu (#120786)
# Motivation
Add xpu header file to `include/ATen/xpu` to make them public.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120786
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 16:22:14 +00:00
7881b95c73 Don't suppress error codes in lint job, properly activate conda (#120769)
Before:

```
2024-02-28T02:38:24.3757573Z + conda activate /opt/conda/envs/py_3.9
2024-02-28T02:38:24.3757872Z
2024-02-28T02:38:24.3758116Z CondaError: Run 'conda init' before 'conda activate'
```

Now, this would actually fail the job, and I also fix the bug.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120769
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/malfet
2024-02-28 15:17:31 +00:00
facfc0baaf Update _constrain_as_size docs (#120728)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120728
Approved by: https://github.com/Skylion007
2024-02-28 15:03:10 +00:00
82099ab87b [easy] Reword unexpected success error messages and generated github issues now that we have sentinel files (#120766)
It's a bit annoying to have to read through the test name in verbose mode just to see what the test's sentinel file is actually called when encountering an unexpected success. Now that we have sentinel files, we can directly list the file path from root in the error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120766
Approved by: https://github.com/Skylion007
2024-02-28 11:15:29 +00:00
46e3f670b4 refactor code to share across different devices (#120602)
# Motivation
Refactor utils code to make it possible to share across CUDA, XPU, and other backends.

# Solution
Move `_dummy_type` and `_LazySeedTracker` to torch._utils;

# Additional Context
When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang
2024-02-28 09:42:58 +00:00
a11a49af58 Add NCCL work sequence number to work info (#120596)
Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely.

Test Plan:
1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq
2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest

Differential Revision: D54180050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596
Approved by: https://github.com/kwen2501
2024-02-28 07:54:37 +00:00
be31e522ce [PT2][Inductor] Fix "example_value" absent for stack nodes (#120655)
Summary:
We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689.

pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz

We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization.

Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
before fix: P1187633689
```
W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5
W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4
W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14
W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16
W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13
W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7
```

after fix:
P1189491364
```
W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
```

Differential Revision: D54140488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120655
Approved by: https://github.com/jackiexu1992
2024-02-28 05:35:36 +00:00
12995a5d9d [2/2] Intel GPU Runtime Upstreaming for Generator (#118613)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers geneartor-related APIs, including

- `torch.xpu.default_generators`
- `torch.xpu.get_rng_state`
- `torch.xpu.get_rng_state_all`
- `torch.xpu.initial_seed`
- `torch.xpu.manual_seed`
- `torch.xpu.manual_seed_all`
- `torch.xpu.seed`
- `torch.xpu.seed_all`
- `torch.xpu.set_rng_state`
- `torch.xpu.set_rng_state_all`

# Additional Context
The differences with CUDA:
The generator-related frontend python APIs are 1:1 mapping with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 05:28:11 +00:00
8ba4cb451f Fix an import loop (#119820)
Summary:
We ran into the following import loop when testing aps:

```
Traceback (most recent call last):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
    code = _serve_one(child_r, fds,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare
    _fixup_main_from_name(data['init_main_from_name'])
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name
    main_content = runpy.run_module(mod_name,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module>
    class ICVRConfig(AdsComboLauncherConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module>
    class AdsComboLauncherConfig(AdsConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module>
    class AdsConfig(RecTrainAppConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module>
    class EmbeddingKernelConfig:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig
    cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module>
    class ParameterSharding:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding
    sharding_spec: Optional[ShardingSpec] = None
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module>
    class ShardingSpec(ABC):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec
    tensor_properties: sharded_tensor_meta.TensorProperties,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module>
    def empty(sharding_spec: shard_spec.ShardingSpec,
ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py)
```

Using future annotations to mitigate.

Test Plan:
```
hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74
CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep
```

Differential Revision: D53685582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119820
Approved by: https://github.com/fegin
2024-02-28 05:09:16 +00:00
e9a961f66a [dynamo][refactor] Use originating_source for HASATTR (#120723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120723
Approved by: https://github.com/jansel
ghstack dependencies: #120520, #120590, #120721
2024-02-28 05:00:59 +00:00
a774baa501 [audio hash update] update the pinned audio hash (#120748)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120748
Approved by: https://github.com/pytorchbot
2024-02-28 04:47:38 +00:00
184e815c74 Add TORCH_LOGS_FORMAT=short alias (#120757)
Shorthand for `"%(levelname)s:%(name)s:%(message)s"` which is hard to
remember.

I find the default formatter annoying since just the metadata fills up
most of the width of my terminal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120757
Approved by: https://github.com/ezyang
2024-02-28 04:40:48 +00:00
bd5f290505 [vision hash update] update the pinned vision hash (#120749)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120749
Approved by: https://github.com/pytorchbot
2024-02-28 04:36:16 +00:00
bfa71b523d add complex32 to v3_dtypes (#120388)
Fixes [#120290](https://github.com/pytorch/pytorch/issues/120290)
Fixes https://github.com/pytorch/pytorch/issues/73502

use `v3_dtypes` and `torch._utils._rebuild_tensor_v3` to handle torch.save(complex32)

result:
![image](https://github.com/pytorch/pytorch/assets/37650440/18b6cbb3-fb3f-4855-9d48-374014647988)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120388
Approved by: https://github.com/albanD
2024-02-28 02:32:29 +00:00
5a53c0ff23 [dynamo][refactor] Rename LIST_LENGTH to SEQUENCE_LENGTH, separate DICT_LENGTH (#120721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120721
Approved by: https://github.com/jansel
ghstack dependencies: #120520, #120590
2024-02-28 02:19:10 +00:00
1627d9e06d [aot_inductor] added a utility function aoti_torch_print_tensor_handle (#120660)
Added a function to print tenosr values for a tensor handle.
It can be injected to the cpp wrapper code and help debug
numerical issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120660
Approved by: https://github.com/desertfire
2024-02-28 02:08:34 +00:00
d21c6eb215 Do not wrap output with input device inside _to_copy (#119868)
Fixing https://github.com/pytorch/pytorch/issues/118790

This diff revert a small part of the code that was introduced in https://github.com/pytorch/pytorch/pull/104689

The PR above added a comment that "In case of dtype promotion, fake tensor converted into tensor"
but its not always the case that a conversion in dtype causes a fake tensor to be a tensor.

When such conversion does not happen we get the following error
```
Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to
 a python object of type FakeTensor
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119868
Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi
2024-02-28 01:51:43 +00:00
33499ec41b [FSDP2][DCP][DSD] Add FSDP2 model state dict unit test with distributed state dict (#120680)
This adds some initial unit tests for FSDP2 model state dict only.

This PR adds two tests:

1. Add a unit test for parity check for FSDP `model.state_dict()` with distributed_state_dict's `get_model_state_dict`.
2. Add a unit test to make sure`StateDictOptions(full_state_dict=True, cpu_offload=True)` in distributed_state_dict work for FSDP2 model state_dict.

Optimizer state dict will be in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120680
Approved by: https://github.com/awgu
2024-02-28 01:40:04 +00:00
1aa9099839 [CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616)
# Motivation
refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616
Approved by: https://github.com/albanD
2024-02-28 01:35:25 +00:00
1a1fc1047d Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
2024-02-28 01:01:41 +00:00
677e67c399 Update nn.Module._apply to not gate on should_use_set_data when swap_tensors is set (#120659)
This updates the nesting of if statements in `nn.Module._apply` such that if

`torch.__future__.set_swap_module_params_on_conversion(True)`, we always try to swap regardless of whether
- `torch._has_compatible_shallow_copy_type(param, fn(param)`
- `torch.__future__.set_overwrite_module_params_on_conversion` is set

This means that `meta_module.to_empty('device')` can now use the swap_tensors path cc @awgu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120659
Approved by: https://github.com/albanD
2024-02-28 00:59:34 +00:00
213b3ac3f2 [BE] fail_* variables don't need to be shared across restarts, they're set only once (#120712)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120712
Approved by: https://github.com/yanboliang
2024-02-28 00:48:11 +00:00
2ebf2c88ba Add test to check that COW inputs are not materialized (#119507)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507
Approved by: https://github.com/ezyang
ghstack dependencies: #120455
2024-02-28 00:37:33 +00:00
cabc09a5f2 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
cyy
1e9fafc160 [Clang-tidy header][20/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#120574)
This PR fixes some clang-tidy warnings in aten/src/ATEN/*.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120574
Approved by: https://github.com/Skylion007
2024-02-28 00:13:05 +00:00
9c597ff137 use condition_variable and wait_until in nccl dump on timeout (#120544)
Fixes test_c10d_nccl.py -k test_timeout_dumps_timing_enabled_True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120544
Approved by: https://github.com/atalman
2024-02-28 00:06:08 +00:00
14b258b5bc Fix broken link in README (#120698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120698
Approved by: https://github.com/janeyx99
2024-02-27 23:55:06 +00:00
5929d4e830 [CUDA][cuBLAS] Check if a context is present when grabbing a cuBLAS handle (#120131)
cuBLAS has indicated that certain kernels will transition to using the driver API over the CUDA runtime API, which we've observed to break existing tests (e.g., DataParallel) that use multithreading and may not eagerly grab a context via `cudaSetDevice`.

CC @Aidyn-A @ptrblck

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120131
Approved by: https://github.com/atalman
2024-02-27 22:45:16 +00:00
f36e00b8ce Revert "[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639)"
This reverts commit 78f53a3f731ee67dcffd308519ed48a745640dde.

Reverted https://github.com/pytorch/pytorch/pull/120639 on behalf of https://github.com/izaitsevfb due to breaking ROCm ([comment](https://github.com/pytorch/pytorch/pull/120639#issuecomment-1967585568))
2024-02-27 21:05:57 +00:00
6cc7f9a2e6 Limit loop unrolling (#120023)
Tacotron2 causes massive loop unrolling resulting in very large graphs (26k nodes) which was causing inductor (and tracing itself) to choke.

The unrolling size is controlled by the environment variable TORCHDYNAMO_MAX_LOOP_UNROLL_NODES which defaults to the arbitrary value 5000.

This updates the tacotron2 timings as follows:
eager timing: 3m:23s -> 35s
aot_eager timing: 4m:12s -> 39s
inductor timing: 22m:24s ->1m

For reference the big loop in tacotron2 was this one (model.py[405]):
```
        while len(mel_outputs) < decoder_inputs.size(0) - 1:
            decoder_input = decoder_inputs[len(mel_outputs)]
            mel_output, gate_output, attention_weights = self.decode(decoder_input)
            mel_outputs += [mel_output.squeeze(1)]
            gate_outputs += [gate_output.squeeze(1)]
            alignments += [attention_weights]
```
which gets unrolled and inlined adding about 36 nodes to the graph per iteration.

Fixes #98467
Relates to #102839 which hopefully will result in a better fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120023
Approved by: https://github.com/yanboliang
2024-02-27 20:44:21 +00:00
f3dd2a544c Revert "Add structured trace logs (#120289)"
This reverts commit 9dfaef962cda5f65eec53e5fd6f07b5226ea65cb.

Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))
2024-02-27 19:49:05 +00:00
eqy
65efece3a4 [CUDA][cuBLAS] Bump test_cublas_baddbmm_large_input tolerances (#117889)
Unfortunate that the current `rtol=1e-5` hits a literal 1 / 1000000 mismatch (`rtol=1.04e-5`) on L40.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117889
Approved by: https://github.com/atalman
2024-02-27 19:05:20 +00:00
5b5c167adc [dynamo] Add some helpers to PyCodegen (#120684)
This are used in later PRs in the stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120684
Approved by: https://github.com/yanboliang
2024-02-27 18:46:51 +00:00
0c8bb6f70c [dtensor] standardize tuple strategy handling for foreach ops (#120695)
This PR refactors the tuple strategy handling logic, and allow
TupleStrategy to have both input/output specs for each OpStrategy child,
so that we could further enable operators like foreach norm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695
Approved by: https://github.com/awgu
2024-02-27 18:23:11 +00:00
440a9b212d [profiler] log process group config information in distributedInfo field (#119443)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Differential Revision: D53557965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443
Approved by: https://github.com/kwen2501
2024-02-27 18:21:54 +00:00
78f53a3f73 [inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639)
Summary:
Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 32 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x.

Similar changes has been done to the HSTU ragged attention kernel D53386525.

Test Plan:

Before:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4%
  SingleProcess AUTOTUNE takes 8.1153 seconds
```

After:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2%
  SingleProcess AUTOTUNE takes 11.4076 seconds
```

Before:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6%
  SingleProcess AUTOTUNE takes 3.4052 seconds
```

After:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8%
  SingleProcess AUTOTUNE takes 11.3538 seconds

```

Before:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6%
  SingleProcess AUTOTUNE takes 9.0523 seconds
```

After:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2%
  SingleProcess AUTOTUNE takes 8.2225 seconds
```

Before:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7%
  SingleProcess AUTOTUNE takes 11.0074 seconds
```

After:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4%
  SingleProcess AUTOTUNE takes 14.9839 seconds
```

Reviewed By: xw285cornell, nmacchioni

Differential Revision: D54203170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120639
Approved by: https://github.com/xw285cornell, https://github.com/jansel
2024-02-27 18:16:33 +00:00
3f62b05d31 [export] Use forward hooks to capture module signatures. (#120468)
Summary:
When we export in on strict mode and turn on preserve_module_call_signature, the following assertion error will occur today:
```
child_split[: len(parent_split)] == parent_split
```
This is due to the fact that we're monkey patching forward call directly, which kinda breaks the attribute propagation in the tracer. It's actually better to implement this by using forward hook because we don't have to alter the original module structure at all during export.

Test Plan: CI

Differential Revision: D54102714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120468
Approved by: https://github.com/ydwu4
2024-02-27 17:44:06 +00:00
ed3c256b61 Add lowering for adaptive_max_pool2d (#120254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120254
Approved by: https://github.com/lezcano
2024-02-27 16:32:18 +00:00
27bb73fe46 [AOTI] Fix a strict-aliasing warning (#120628)
Summary: This gets rid of an annoying compile time warning, "dereferencing type-punned pointer will break strict-aliasing rules"

Differential Revision: [D54207229](https://our.internmc.facebook.com/intern/diff/D54207229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120628
Approved by: https://github.com/Skylion007
2024-02-27 15:09:13 +00:00
c29ac05ac0 [inductor] correctly retrieve the "shared" attribute from a Triton binary (#120666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120666
Approved by: https://github.com/jansel
2024-02-27 13:10:09 +00:00
435063aa89 Decomposition for upsample_linear{1d, 3d} (#114774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114774
Approved by: https://github.com/lezcano, https://github.com/vfdev-5, https://github.com/peterbell10
2024-02-27 11:57:45 +00:00
2ad66e6bf0 Fix test failure: Add torch.cuda._get_device_properties to dynamo trace rules (#120620)
In this PR stack, there were unrelated test failures within test_trace_rules.py - It turned out that torch.cuda._get_device_properties should be registered in _dynamoc/trace_rules.py. A test failed because it was not.

This is a small fix which tries to get rid of the test failure by manually registering that function.

Note:
I am not sure whether this is the best way to fix this, as I am neither familiar with the trace rules nor with the introduction of torch.cuda._get_device_properties.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120620
Approved by: https://github.com/Skylion007
2024-02-27 10:46:01 +00:00
e3d64c4d5d [dynamo] Desugar accumulate_grad, fix .grad handling (#120590)
Fixes https://github.com/pytorch/pytorch/issues/118435
Fixes https://github.com/pytorch/pytorch/issues/119906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120590
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #120520
2024-02-27 10:12:26 +00:00
9db6a849ed [FSDP] Clean missing and unexpected keys (#120600)
Currently, when loading w/strict=False or w/strict=True and looking at
error message, FQNs are garbled w/FSDP details such as "_fsdp_wrapped_module".
This makes it tricky for upstream applications to validate the expected set of
keys are missing / unexpected (for example with PEFT where state_dict is loaded
non-strict), and makes error message more complicated w/FSDP details.

This PR cleans those prefixes by using `clean_tensor_name` in FSDP's existing
post load_state_dict hooks. Currently, only full_state_dict impl is tested, can test the rest of the impls as follow up work.

Differential Revision: [D54182472](https://our.internmc.facebook.com/intern/diff/D54182472/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120600
Approved by: https://github.com/XilunWu, https://github.com/fegin
2024-02-27 07:43:45 +00:00
b2a318d856 [PyTorch][ExportedProgram] add 'is_lifted_tensor_constant' and 'get_lifted_tensor_constant' utils (#120546)
as title

Differential Revision: [D54149274](https://our.internmc.facebook.com/intern/diff/D54149274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120546
Approved by: https://github.com/kirklandsign
2024-02-27 07:16:55 +00:00
7c556428c7 Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)
Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:

- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`

[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn
2024-02-27 07:05:48 +00:00
cbbc309cae [pytree][reland] Require pytree serialized_type_name (#120636)
Relanding https://github.com/pytorch/pytorch/pull/119718 as the diff which prevents breakages of torchrec [D53857843](https://www.internalfb.com/diff/D53857843) has landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120636
Approved by: https://github.com/avikchaudhuri
2024-02-27 06:53:33 +00:00
12f724c779 [export] preserve constant fqn (#120664)
Summary:
Previously we were renaming constants to `lifted_constant_tensor0` or equivalent. This PR changes things so that the constants retain the same FQN as in the original eager module.

Actually, `symbolic_trace` already is supposed to do this, but the code path is not triggered when used from `make_fx`, since we don't pass an actual `nn.Module` instance to `trace()`, but rather a multiply-wrapped-functionalized-lambda-thing.

So, I reproduced the essential logic outside of make_fx, at the export layer.

Test Plan: added a unit test

Differential Revision: D54221616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120664
Approved by: https://github.com/SherlockNoMad
2024-02-27 06:35:51 +00:00
a358b23a6a Keep test order due to rename_privateuse1_backend is disposable (#120464)
With the change in https://github.com/pytorch/pytorch/pull/120399.
As rename_privateuse1_backend is disposable, run test_external_module_register with an renamed backend may cause problem. Try to change the testcase name and keep the right order (ASCII).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120464
Approved by: https://github.com/albanD
2024-02-27 05:38:43 +00:00
5a5b654481 [BE]: Enable ruff LOG checks (#120674)
Enable LOG error codes in ruff to find bad usages of the logger: https://docs.astral.sh/ruff/rules/#flake8-logging-log
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120674
Approved by: https://github.com/ezyang
2024-02-27 04:37:20 +00:00
b6139b1e57 [PyTorch][CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050)
Differential Revision: D53734057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120050
Approved by: https://github.com/xw285cornell
2024-02-27 04:34:53 +00:00
a1c641f118 [executorch hash update] update the pinned executorch hash (#120675)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120675
Approved by: https://github.com/pytorchbot
2024-02-27 03:59:16 +00:00
237773132d Restore artifact name in log messages (#120671)
Yuzhen Huang was complaining to me that searching for `__recompile`
no longer works.  This is because the glog format is filename, not
logger name, so we lost the artifact name.  Add it back.

Looks like:

```
V0226 15:56:04.142000 139828992779264 torch/_dynamo/guards.py:1084] [0/2] __guards: ___check_type_id(L['inputs'], 7626144)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120671
Approved by: https://github.com/Skylion007
2024-02-27 03:37:11 +00:00
ac28571742 [vision hash update] update the pinned vision hash (#119944)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119944
Approved by: https://github.com/pytorchbot
2024-02-27 03:25:51 +00:00
9d423f0e91 [audio hash update] update the pinned audio hash (#120135)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120135
Approved by: https://github.com/pytorchbot
2024-02-27 03:20:00 +00:00
63f874b476 [dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120593
Approved by: https://github.com/jansel
2024-02-27 03:13:55 +00:00
27990045ff docker: Only match tags that start with v* (#120670)
To avoid issues where version could be confused with a ciflow tag.

Example:

```
❯ git describe --tags --always
ciflow/periodic/c3496d50f0bb437c70f27085f71155209277bfd4-47-g4ca24959d1a
❯ git describe --tags --always --match "v[1-9]*.*"
v1.8.0-rc1-36500-g4ca24959d1a
```

Resolves https://github.com/pytorch/pytorch/issues/120392

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120670
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-02-27 02:55:33 +00:00
cf6df886a0 Remove hard numpy dependency from experimental_ops.py (#119520)
Based on similar code in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119520
Approved by: https://github.com/albanD
2024-02-27 02:46:13 +00:00
2de7468d2b Switch to native functional collective by default (#120370)
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.

Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).

Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
2024-02-27 01:53:56 +00:00
8a59f49da2 [dynamo][compile-time] Collect guard debug stack info only with logs enabled (#120520)
Reduces backend=eager compile time from 33 to 19 seconds for `MobileBertForQuestionAnswering`. This also helps an internal model where guards.add function is taking 124 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120520
Approved by: https://github.com/mlazos
2024-02-27 01:51:16 +00:00
2e0e545759 [EZ][BE] Use nested namespace in functorch (#120663)
I should really enable this clang-tidy check rather than doing it by hand
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120663
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-02-27 01:45:32 +00:00
b3fe53e1ad [1/2] Intel GPU Runtime Upstreaming for Generator (#118528)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the last runtime component we would like to upstream is `Generator` which is responsible for the pseudo-random number generation. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `aten`.

# Design
Following the previous design, `c10::GeneratorImpl` is the device-agnostic abstraction of a random number generator. So we will introduce an XPU generator `XPUGeneratorImpl`, inheriting from `c10::GeneratorImpl`, to manage random states on an Intel GPU device. Intel GPU runtime `Generator` adopts the same algorithm as CPU. The corresponding C++ file should be placed in aten/src/ATen/xpu/ folder and is built in `libtorch_xpu.so`.
This PR provide the list of APIs:
- `getDefaultXPUGenerator`
- `createXPUGenerator`

# Additional Context
The 2nd PR will cover `python frontend`.

The differences with CUDA:
The generator-related ATen CPP APIs are 1:1 mapping with CUDA.
The XPUGeneratorImpl's member functions have slight differences with CUDA.
lack of CUDA-related counterpart APIs listed below:
- capture_prologue
- capture_epilogue
- philox_cuda_state
- reset_rnn_state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118528
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-02-27 01:39:40 +00:00
f064dec7e0 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-27 01:34:59 +00:00
ef9b6d6816 Replace individual detaches with overall torch.no_grad decorator (#120638)
Fixes https://github.com/pytorch/pytorch/issues/120611.

At first, I thought there were too many detaches, but @awgu and I made the conclusion that both `clip_grad_norm_` and `clip_grad_value_` should be run under torch.no_grad similar to optimizer step. One option is to continue calling `detach`, but doing that on many tensors is slower than setting the context to be no_grad (I think?) and Andrew had noticed: "the 1st round of detaches takes 10 ms for FSDP2, whereas existing FSDP's clip_grad_norm_ only takes 3 ms total" since there are more tensors in FSDP2.

This change also disables grad mode for the foreach path of `clip_grad_value_`, which the first attempt that didn't do this was an oversight. Not sure how to add a test case for this since grad mode will be turned back on after the call.

New profile is not much different from the one in the bottom of this stack, but the number of detaches is 0 :D:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (c71bcceb)]$ python playground2.py
STAGE:2024-02-26 13:07:15 211224:211224 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        70.63%     110.415ms        70.63%     110.415ms       5.811ms       0.000us         0.00%       0.000us       0.000us            19
                               aten::linalg_vector_norm         0.18%     284.000us        26.00%      40.636ms      40.636ms       3.000us         0.99%       3.000us       3.000us             1
                                            aten::clamp         0.09%     148.000us        14.88%      23.261ms      23.261ms       1.000us         0.33%       1.000us       1.000us             1
                                               aten::to         0.75%       1.170ms        14.05%      21.970ms      84.826us       0.000us         0.00%     258.000us       0.996us           259
                                         aten::_to_copy         2.28%       3.562ms        13.31%      20.800ms     161.240us       0.000us         0.00%     258.000us       2.000us           129
                                    aten::_foreach_norm         4.44%       6.935ms        12.72%      19.878ms       9.939ms      19.000us         6.29%      21.000us      10.500us             2
                                              aten::add         0.11%     173.000us        10.97%      17.153ms      17.153ms       1.000us         0.33%       1.000us       1.000us             1
                                            aten::stack         2.99%       4.673ms         9.15%      14.300ms      14.300ms       0.000us         0.00%       6.000us       6.000us             1
                                            aten::copy_         5.49%       8.586ms         8.96%      14.001ms     108.535us     258.000us        85.43%     258.000us       2.000us           129
                                       aten::reciprocal         0.11%     179.000us         8.35%      13.051ms      13.051ms       1.000us         0.33%       1.000us       1.000us             1
                                              aten::cat         0.64%     993.000us         4.42%       6.902ms       6.902ms       6.000us         1.99%       6.000us       6.000us             1
                                            aten::zeros         0.04%      69.000us         4.28%       6.698ms       3.349ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::zero_         0.04%      66.000us         4.13%       6.462ms       3.231ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::fill_         0.06%      98.000us         4.09%       6.396ms       3.198ms       2.000us         0.66%       2.000us       1.000us             2
                                    aten::_foreach_mul_         1.50%       2.342ms         3.79%       5.924ms       2.962ms      10.000us         3.31%      10.000us       5.000us             2
                                            aten::empty         3.27%       5.115ms         3.27%       5.115ms      19.826us       0.000us         0.00%       0.000us       0.000us           258
                                    aten::empty_strided         2.07%       3.237ms         2.07%       3.237ms      25.093us       0.000us         0.00%       0.000us       0.000us           129
                             cudaDeviceEnablePeerAccess         1.93%       3.023ms         1.93%       3.023ms       1.512ms       0.000us         0.00%       0.000us       0.000us             2
                                        aten::unsqueeze         1.21%       1.896ms         1.74%       2.725ms      10.645us       0.000us         0.00%       0.000us       0.000us           256
                                        cudaMemcpyAsync         1.01%       1.572ms         1.01%       1.572ms      12.186us       0.000us         0.00%       0.000us       0.000us           129
                                       aten::as_strided         0.54%     839.000us         0.54%     839.000us       3.265us       0.000us         0.00%       0.000us       0.000us           257
                                    cudaStreamWaitEvent         0.34%     539.000us         0.34%     539.000us       2.089us       0.000us         0.00%       0.000us       0.000us           258
                                        cudaEventRecord         0.18%     274.000us         0.18%     274.000us       1.062us       0.000us         0.00%       0.000us       0.000us           258
                                              aten::mul         0.07%     107.000us         0.08%     132.000us     132.000us       1.000us         0.33%       1.000us       1.000us             1
                                  cudaDeviceSynchronize         0.01%      17.000us         0.01%      17.000us       8.500us       0.000us         0.00%       0.000us       0.000us             2
                                cudaDeviceCanAccessPeer         0.00%       7.000us         0.00%       7.000us       3.500us       0.000us         0.00%       0.000us       0.000us             2
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us         0.66%       2.000us       1.000us             2
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         4.30%      13.000us       3.250us             4
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
                         Memcpy PtoP (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     258.000us        85.43%     258.000us       2.000us           129
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.99%       3.000us       3.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         3.31%      10.000us       2.500us             4
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 156.319ms
Self CUDA time total: 302.000us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120638
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #120623
2024-02-27 01:27:05 +00:00
df72819f91 clip_grad_norm can use fast foreach path for inf norm (#120623)
Now that foreach_norm supports inf, we should not special case it.

For a mere 256 parameters, we get a win of 30ms in CPU time and ~800us -> 300us decrease in CUDA time. This win is only bigger for more parameters.

New profile:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (bf1c0490|REBASE-i|detached HEAD)]$ python playground2.py
STAGE:2024-02-26 13:14:10 395517:395517 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        67.01%     102.262ms        67.01%     102.262ms       5.382ms       2.000us         0.66%       2.000us       0.105us            19
                               aten::linalg_vector_norm         0.20%     311.000us        23.44%      35.776ms      35.776ms       3.000us         0.99%       3.000us       3.000us             1
                                               aten::to         0.79%       1.208ms        14.62%      22.311ms      86.143us       0.000us         0.00%     263.000us       1.015us           259
                                            aten::clamp         0.12%     182.000us        13.96%      21.303ms      21.303ms       1.000us         0.33%       1.000us       1.000us             1
                                         aten::_to_copy         2.38%       3.628ms        13.83%      21.103ms     163.589us       0.000us         0.00%     263.000us       2.039us           129
                                    aten::_foreach_norm         4.71%       7.185ms        13.54%      20.659ms      10.329ms      19.000us         6.29%      23.000us      11.500us             2
                                              aten::add         0.14%     211.000us        10.86%      16.580ms      16.580ms       1.000us         0.33%       1.000us       1.000us             1
                                            aten::stack         3.11%       4.744ms         9.59%      14.642ms      14.642ms       0.000us         0.00%       6.000us       6.000us             1
                                            aten::copy_         5.71%       8.721ms         9.27%      14.152ms     109.705us     258.000us        85.43%     263.000us       2.039us           129
                                       aten::reciprocal         0.13%     193.000us         7.93%      12.100ms      12.100ms       1.000us         0.33%       1.000us       1.000us             1
                                              aten::cat         0.67%       1.017ms         4.67%       7.129ms       7.129ms       6.000us         1.99%       6.000us       6.000us             1
                                            aten::zeros         0.05%      79.000us         4.46%       6.800ms       3.400ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::zero_         0.05%      79.000us         4.28%       6.537ms       3.268ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::fill_         0.09%     131.000us         4.23%       6.458ms       3.229ms       2.000us         0.66%       2.000us       1.000us             2
                                    aten::_foreach_mul_         1.56%       2.377ms         3.86%       5.896ms       2.948ms      10.000us         3.31%      10.000us       5.000us             2
                                            aten::empty         3.55%       5.414ms         3.55%       5.414ms      20.984us       0.000us         0.00%       0.000us       0.000us           258
                                    aten::empty_strided         2.18%       3.323ms         2.18%       3.323ms      25.760us       0.000us         0.00%       0.000us       0.000us           129
                                           aten::detach         0.85%       1.302ms         2.10%       3.199ms      12.496us       0.000us         0.00%       0.000us       0.000us           256
                             cudaDeviceEnablePeerAccess         2.01%       3.069ms         2.01%       3.069ms       1.534ms       0.000us         0.00%       0.000us       0.000us             2
                                        aten::unsqueeze         1.24%       1.899ms         1.81%       2.769ms      10.816us       0.000us         0.00%       0.000us       0.000us           256
                                                 detach         1.24%       1.897ms         1.24%       1.897ms       7.410us       0.000us         0.00%       0.000us       0.000us           256
                                        cudaMemcpyAsync         1.01%       1.539ms         1.01%       1.539ms      11.930us       0.000us         0.00%       0.000us       0.000us           129
                                       aten::as_strided         0.58%     881.000us         0.58%     881.000us       3.428us       0.000us         0.00%       0.000us       0.000us           257
                                    cudaStreamWaitEvent         0.35%     540.000us         0.35%     540.000us       2.093us       0.000us         0.00%       0.000us       0.000us           258
                                        cudaEventRecord         0.18%     278.000us         0.18%     278.000us       1.078us       5.000us         1.66%       5.000us       0.019us           258
                                              aten::mul         0.08%     125.000us         0.09%     138.000us     138.000us       1.000us         0.33%       1.000us       1.000us             1
                                  cudaDeviceSynchronize         0.01%      13.000us         0.01%      13.000us       6.500us       0.000us         0.00%       0.000us       0.000us             2
                                cudaDeviceCanAccessPeer         0.00%       5.000us         0.00%       5.000us       2.500us       0.000us         0.00%       0.000us       0.000us             2
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us         0.66%       2.000us       1.000us             2
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         4.30%      13.000us       3.250us             4
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
                         Memcpy PtoP (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     258.000us        85.43%     258.000us       2.000us           129
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.99%       3.000us       3.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         3.31%      10.000us       2.500us             4
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 152.613ms
Self CUDA time total: 302.000us
```

Compared to on main:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5a0a9644)]$ python playground2.py
STAGE:2024-02-26 13:09:56 285045:285045 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        61.42%     113.375ms        61.42%     113.375ms     424.625us      45.000us         5.66%      45.000us       0.169us           267
                               aten::linalg_vector_norm        14.04%      25.909ms        37.67%      69.534ms     271.617us     514.000us        64.65%     559.000us       2.184us           256
                                               aten::to         0.78%       1.433ms        12.87%      23.751ms      91.703us       0.000us         0.00%     278.000us       1.073us           259
                                         aten::_to_copy         2.02%       3.730ms        12.09%      22.318ms     173.008us       0.000us         0.00%     278.000us       2.155us           129
                                            aten::clamp         0.09%     174.000us        11.43%      21.103ms      21.103ms       1.000us         0.13%       1.000us       1.000us             1
                                              aten::add         0.11%     205.000us         9.08%      16.768ms      16.768ms       1.000us         0.13%       1.000us       1.000us             1
                                            aten::copy_         4.94%       9.112ms         8.15%      15.043ms     116.612us     258.000us        32.45%     278.000us       2.155us           129
                                            aten::stack         2.76%       5.091ms         7.97%      14.719ms      14.719ms       0.000us         0.00%       6.000us       6.000us             1
                                       aten::reciprocal         0.11%     194.000us         7.01%      12.933ms      12.933ms       1.000us         0.13%       1.000us       1.000us             1
                                              aten::max         0.09%     165.000us         6.43%      11.868ms      11.868ms       3.000us         0.38%       3.000us       3.000us             1
                                           aten::detach         1.58%       2.911ms         4.12%       7.596ms      14.836us       0.000us         0.00%       0.000us       0.000us           512
                                              aten::cat         0.56%       1.042ms         3.73%       6.882ms       6.882ms       6.000us         0.75%       6.000us       6.000us             1
                                    aten::_foreach_mul_         1.36%       2.503ms         3.33%       6.145ms       3.072ms      10.000us         1.26%      10.000us       5.000us             2
                                                 detach         2.54%       4.685ms         2.54%       4.685ms       9.150us       0.000us         0.00%       0.000us       0.000us           512
                                    aten::empty_strided         1.92%       3.545ms         1.92%       3.545ms      27.481us       0.000us         0.00%       0.000us       0.000us           129
                             cudaDeviceEnablePeerAccess         1.64%       3.022ms         1.64%       3.022ms       1.511ms       0.000us         0.00%       0.000us       0.000us             2
                                        aten::unsqueeze         1.03%       1.892ms         1.49%       2.746ms      10.727us       0.000us         0.00%       0.000us       0.000us           256
                                       aten::as_strided         1.35%       2.494ms         1.35%       2.494ms       4.862us       0.000us         0.00%       0.000us       0.000us           513
                                        cudaMemcpyAsync         1.01%       1.868ms         1.01%       1.868ms      14.481us       4.000us         0.50%       4.000us       0.031us           129
                                    cudaStreamWaitEvent         0.41%     760.000us         0.41%     760.000us       2.946us       8.000us         1.01%       8.000us       0.031us           258
                                        cudaEventRecord         0.15%     276.000us         0.15%     276.000us       1.070us       8.000us         1.01%       8.000us       0.031us           258
                                              aten::mul         0.08%     139.000us         0.08%     153.000us     153.000us       1.000us         0.13%       1.000us       1.000us             1
                                            aten::empty         0.02%      35.000us         0.02%      35.000us      35.000us       0.000us         0.00%       0.000us       0.000us             1
                                  cudaDeviceSynchronize         0.01%      14.000us         0.01%      14.000us       7.000us       0.000us         0.00%       0.000us       0.000us             2
                                cudaDeviceCanAccessPeer         0.00%       5.000us         0.00%       5.000us       2.500us       0.000us         0.00%       0.000us       0.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     514.000us        64.65%     514.000us       2.008us           256
                         Memcpy PtoP (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     258.000us        32.45%     258.000us       2.000us           129
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.75%       6.000us       3.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.38%       3.000us       3.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         1.26%      10.000us       2.500us             4
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 184.579ms
Self CUDA time total: 795.000us
```

For script:
```
import torch
from math import inf
from torch.nn.utils import clip_grad_norm_

params = [torch.rand(32, 16, device="cuda:3")*5 for _ in range(128)] + [torch.rand(32, 16, device="cuda:4")*-7 for _ in range(128)]
for p in params:
    p.grad = torch.rand_like(p)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    total_norm = clip_grad_norm_(params, 10.0, norm_type=inf)
    torch.cuda.synchronize()

print(p.key_averages().table(sort_by="cpu_time_total"))
print(total_norm)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120623
Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki
2024-02-27 01:27:05 +00:00
b01bd1f7a1 Revert "Add torch.ops.aten.print (#120295)"
This reverts commit 3b944113c837e1111510487f4525aa07039462fe.

Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))
2024-02-27 01:18:48 +00:00
17560eb472 Revert "[Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444)"
This reverts commit 4d2073bc3faa7f2014c4fb2f568e68fe195b6f99.

Reverted https://github.com/pytorch/pytorch/pull/120444 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54192376 ([comment](https://github.com/pytorch/pytorch/pull/120444#issuecomment-1965600268))
2024-02-27 00:58:00 +00:00
e874376f6a Mark test_reference_numerics_extremal__refs_frexp_cuda as xfail on windows (#120640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120640
Approved by: https://github.com/clee2000
2024-02-27 00:35:55 +00:00
d341b66e96 Revert [dynamo] support group=None when rewriting collectives (#12018) (#120677)
This reverts commit 298c686d3f7bc26399481b8830e71c4f02ce629c.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677
Approved by: https://github.com/yifuwang, https://github.com/huydhn
2024-02-27 00:33:35 +00:00
fdae9363b3 [meta registration] efficient_attention_forward fix for NT inputs (#120594)
When cu_seqlens_q is provided, we should use the user-specified max_seqlen_q instead of inferring it as query.size(1):

1c7b0e7cd1/aten/src/ATen/native/transformers/cuda/attention.cu (L989)

This wasn't caught because the value is taken as ceil(max_seqlen / 32) * 32; in the opinfos, and the opinfo inputs were small enough that this value was 32 in either case.

Differential Revision: [D54179733](https://our.internmc.facebook.com/intern/diff/D54179733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120594
Approved by: https://github.com/drisspg
2024-02-27 00:10:37 +00:00
9dfaef962c Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
2024-02-27 00:04:23 +00:00
ecb3f33a1a [dynamo] fix segfault in _debug_get_cache_entry_list (#120635)
Fix https://github.com/pytorch/pytorch/issues/120607.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120635
Approved by: https://github.com/jansel
2024-02-26 23:31:09 +00:00
64660b51f6 Add the hyperlink of the transfomer doc (#120565)
Fixes #120488

- The shape for forward pass is clearly stated in the main [transformer class](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)

- Boolean mask for _key_padding_mask is also explained in the main transformer class.

Therefore, add the hyperlink to the transformer class explicitly so the user can refer back to the main class. Also, correct several symbols in the transform doc from normal text style to math style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120565
Approved by: https://github.com/mikaylagawarecki
2024-02-26 23:11:58 +00:00
Kai
c59b14163b Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2024-02-26 23:04:52 +00:00
30625ae582 Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-26 22:21:14 +00:00
41adec3c59 Revert "Switch to native functional collective by default (#120370)"
This reverts commit 1f1bc0e6acc3613339b1001a7c9fcd1dfe7b6580.

Reverted https://github.com/pytorch/pytorch/pull/120370 on behalf of https://github.com/yifuwang due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120370#issuecomment-1965362938))
2024-02-26 21:55:13 +00:00
7b1cc140aa Use lxml in scripts/compile_tests when it is available (#120633)
It's around 30x (300s -> 10s) faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120633
Approved by: https://github.com/oulgen
2024-02-26 21:35:22 +00:00
5a0a964444 [Dynamo] Fix guards for script_if_tracing or lru_cache fn with default args (#120390)
Fixes #120387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120390
Approved by: https://github.com/anijain2305
2024-02-26 19:40:14 +00:00
55b5908427 [PT2][Inductor]Add unbind node normalization (#120253)
Summary: Normalize unbind nodes for the followup split_cat pattern detection and node removals

Test Plan:
```
buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/f42297c2-2595-40a2-b270-5cec026f2fe4
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17451448578242323
Network: Up: 132KiB  Down: 88KiB  (reSessionID-fc725143-317a-42a9-bc7e-0bbab6ef9e5c)
Jobs completed: 27. Time elapsed: 3:09.2s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb
```

Differential Revision: D53964593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120253
Approved by: https://github.com/jackiexu1992
2024-02-26 19:13:26 +00:00
274b362442 [FSDP] Removed .detach in clip_grad_norm_ (#120612)
This seems unnecessary under `no_grad()` context. The unit tests still pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120612
Approved by: https://github.com/Skylion007
ghstack dependencies: #120231
2024-02-26 19:03:00 +00:00
fd3cf88f27 Rewrite docs about why we guard on dynamic dims (#120566)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120566
Approved by: https://github.com/desertfire
2024-02-26 18:58:30 +00:00
759204253f [export] Change runtime asserts to using assert_scalar (#119608)
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.

https://github.com/pytorch/pytorch/issues/119587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
2024-02-26 17:56:12 +00:00
2fb32a5f07 Enable fake tensor caching in fbcode by default (#118555)
Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too.

Test Plan: Ran torchbench benchmarks in fbcode

Differential Revision: [D53771626](https://our.internmc.facebook.com/intern/diff/D53771626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555
Approved by: https://github.com/eellison
2024-02-26 17:35:23 +00:00
ee01d0807b [dynamo] Function => FunctionCtx for placeholder obj (#120577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120577
Approved by: https://github.com/yanboliang
2024-02-26 17:16:31 +00:00
7eb7ac815f [inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal

Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
2024-02-26 17:01:47 +00:00
c39bbd6def Numbers based TD (#119901)
Convert from a list/bucket based TD system to just a numbers based TD system.  Looks like a massive change but a decent amount of it is tests and removing code.

Main file of interest is interface.py, which Github is collapsing by default due to size

The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant.

Other notable changes:
* Use Frozenset to make TestRun hashable
* Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901
Approved by: https://github.com/osalpekar, https://github.com/huydhn
2024-02-26 17:01:19 +00:00
86063b4d03 Add torch._print to dynamo trace_rules (#120533)
Fixes #114831

Before:
```
(pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $  python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
F
======================================================================
FAIL: test_torch_name_rule_map_updated (__main__.TraceRuleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2739, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 328, in test_torch_name_rule_map_updated
    self._check_set_equality(
  File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 302, in _check_set_equality
    self.assertTrue(len(x) == 0, msg1)
AssertionError: False is not true : New torch objects: {<built-in method _print of type object at 0x7ff477e40ee0>} were not added to `trace_rules.torch_c_binding_in_graph_functions` or `test_trace_rules.ignored_c_binding_in_graph_function_names`. Refer the instruction in `torch/_dynamo/trace_rules.py` for more details.

To execute this test, run the following from the base repo dir:
     python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.184s

FAILED (failures=1)
```
After change:
```
(pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $  python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
.
----------------------------------------------------------------------
Ran 1 test in 0.209s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120533
Approved by: https://github.com/clee2000, https://github.com/yanboliang, https://github.com/huydhn, https://github.com/Skylion007
2024-02-26 16:52:59 +00:00
8a32a07856 Revert "Add meta device support to sparse compressed tensors (#120498)"
This reverts commit 5d71ba688563ef491bb28d47c493ec6fc7791da2.

Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))
2024-02-26 15:59:36 +00:00
b381a4372b make GPT2ForSequenceClassification pass inference accuracy check (#120537)
We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in
```
time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification
```
to --float16 or --float32 it will pass the accuracy check.

Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537
Approved by: https://github.com/jansel
2024-02-26 11:02:57 +00:00
f4cf25bb24 Fix a bug where nn.functional._AllGather.backward produces wrong gradients (#120582)
Summary:
Fixes #120386

`_AllGather.backward` assumes that `_ReduceScatter` would always in-place update the output buffer. However, when the output buffer is non-contiguous, `_ReduceScatter` would allocate and return a different buffer, causing the gradient to be thrown away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120582
Approved by: https://github.com/XilunWu
2024-02-26 09:58:27 +00:00
c617e7b407 Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384)
After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is:  `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. `

- `torch.use_deterministic_algorithms(True)` only setting for accuracy test. fff9d98e58/benchmarks/dynamo/common.py (L3480)
- However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)

Add these 2 models into the deterministic_algorithms exclusive model list in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384
Approved by: https://github.com/desertfire, https://github.com/jgong5
2024-02-26 05:05:43 +00:00
a299db2983 [dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120469
Approved by: https://github.com/jansel
2024-02-26 04:37:40 +00:00
1c7b0e7cd1 [inductor][cpp] disable masked load for non-fp data types (#120558)
Fix https://github.com/pytorch/pytorch/issues/120377. We disable the masked load for non-fp data types for now. The complete support of masks will be added in https://github.com/pytorch/pytorch/pull/119654.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120558
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-02-26 04:12:22 +00:00
ea20885d95 [executorch hash update] update the pinned executorch hash (#120264)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120264
Approved by: https://github.com/pytorchbot
2024-02-26 03:55:32 +00:00
c18623b7ed [dynamo] Reland 120147 - - Use EQUALS_MATCH guard for mod.training (#120578)
To fix Memory leak discovered in https://github.com/pytorch/pytorch/issues/112090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120578
Approved by: https://github.com/jansel
2024-02-26 03:49:47 +00:00
685d862c45 Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new. (#119263)
1) Using items stored in torch._tensor_classes to check item passed from python side;
2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new;
3) Using more general API to get python module name in get_storage_obj and get_name functions.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263
Approved by: https://github.com/ezyang
2024-02-26 01:54:30 +00:00
4328e772bf [dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120416
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344, #120359
2024-02-25 23:24:24 +00:00
c269e48af0 [dynamo][guards-cpp-refactor] DictGuardManager (#120359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120359
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344
2024-02-25 23:24:24 +00:00
775a4388d9 [dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120344
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342
2024-02-25 23:24:04 +00:00
5d71ba6885 Add meta device support to sparse compressed tensors (#120498)
As in the title.

Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498
Approved by: https://github.com/ezyang
2024-02-25 16:50:17 +00:00
834c7a1d3e [dynamo][refactor] Move some helper functions to global scope (#120426)
This is to prepare for guard C++ refactor work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120426
Approved by: https://github.com/ezyang
2024-02-25 04:38:20 +00:00
5c7b761f8e Fix default world_size when running on 1 or 0 GPU (#119372)
the mentioned distributed tests would fail if the number of GPUs available isn't sufficient. need to correct the default world size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119372
Approved by: https://github.com/eqy, https://github.com/fegin
2024-02-25 04:14:34 +00:00
cyy
81f0b2c14e [Clang-tidy header][19/N] Enable clang-tidy on torch/csrc/autograd/profiler_legacy.* (#120552)
This PR enables clang-tidy on torch/csrc/autograd/profiler_legacy.* and cleans some path rules of clang-tidy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120552
Approved by: https://github.com/Skylion007
2024-02-25 03:29:40 +00:00
298c686d3f [dynamo] support group=None when rewriting collectives (#120118)
Resolves case 2 in #120082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118
Approved by: https://github.com/wconstab
ghstack dependencies: #120370
2024-02-25 03:12:10 +00:00
3e382456c1 Fix compiler check (#120492)
Fixes #119304

1. Add try catch to handle the compiler version check.
2. Retry to query compiler version info.
3. Return False if can't get compiler info twice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120492
Approved by: https://github.com/ezyang
2024-02-25 02:41:20 +00:00
0f20cc1e0e Don't use size on TensorVariable when doing out resize test (#120567)
Fixes https://github.com/pytorch/pytorch/issues/120482
Fixes https://github.com/pytorch/pytorch/issues/120511

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120567
Approved by: https://github.com/Skylion007
2024-02-25 02:21:34 +00:00
54c1cf8d8a add distributed checkpoint support for custom device (#120201)
Fixes #120200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201
Approved by: https://github.com/fegin, https://github.com/wz337
2024-02-24 19:14:29 +00:00
56203fc407 Add profiling for backward (#120540)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120540
Approved by: https://github.com/anijain2305
2024-02-24 16:53:28 +00:00
a17979faa6 [dynamo] add stronger test for dynamo memory leaks (#120459)
This issue was raised by a regression of https://github.com/pytorch/pytorch/issues/112090 caused by https://github.com/pytorch/pytorch/pull/120147.

Make the memory leak test stronger by using weakref to check for model deletion instead of measuring CUDA memory allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120459
Approved by: https://github.com/jansel
2024-02-24 16:30:20 +00:00
a62d9184d5 [ET-VK] Move graph runtime from PT directory to ET directory (#120528)
Summary:
## Context

Move Vulkan graph runtime from PyTorch directory to ExecuTorch directory to improve development logistics:

* ExecuTorch delegate changes will no longer require export to PyTorch directory
* Makes it much easier to enable OSS build for Vulkan delegate

Test Plan:
```
LD_LIBRARY_PATH=/home/ssjia/fbsource/third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/executorch/backends/vulkan/test:vulkan_compute_api_test_bin

buck2 run fbcode//executorch/backends/vulkan/test:test_vulkan_delegate
```

Differential Revision: D54133350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120528
Approved by: https://github.com/manuelcandales
2024-02-24 15:00:21 +00:00
1f1bc0e6ac Switch to native functional collective by default (#120370)
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.

Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).

Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
2024-02-24 09:38:26 +00:00
33938cfddd [BE][Ez] Update ruff to 0.2.2 (#120517)
Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517
Approved by: https://github.com/albanD
2024-02-24 07:13:53 +00:00
79f059987e Update find_test_dir() to check for skip files relative to the local path first. (#120521)
The search code to find the dynamo skip files wasn't working properly when used with pytest and multiple files:
```
pytest a.py b.py
```
because pytest would point `__main__` at itself instead of the individual file. (This worked fine when only running a single file test)

Change the scanning code to look for the skip directory relative to its own file first.

While in there add/update some comments and log a warning when the directory wasn't found (instead of a hard crash).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120521
Approved by: https://github.com/oulgen
2024-02-24 03:29:25 +00:00
15add24bf2 fix: set codegen in _SplitterBase partitioner (#120361)
For graphs with single output, the expectation of torch.export / torch.compile graph_module output type is a single torch.tensor instead of a tuple.
However,  after using `_SplitterBase` partitioner on these graph_module (obtained from torch.export/torch.compile), the resultant graph module will return a tuple of tensors, in this case `(output,)`.

This PR adds codegen to the graphs produced by `_SplitterBase` partitioner. Setting this will ensure pytree unflatten nodes will be added automatically to handle unflattening of the output to return single outputs directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120361
Approved by: https://github.com/angelayi
2024-02-24 02:27:20 +00:00
3eefe96297 Update scripts/compile_tests/update_failures.py (#120529)
In order to unbreak this script, I have only tested with
```
./scripts/compile_tests/update_failures.py 97918e8c37e649dc8782bb1822ae954bca904d0f
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120529
Approved by: https://github.com/zou3519
2024-02-23 22:15:44 +00:00
b7df3bba62 add decomposition for frexp (#119217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217
Approved by: https://github.com/peterbell10
ghstack dependencies: #119284, #120027
2024-02-23 21:52:42 +00:00
f7e79299c7 register torch.return_types in torch.fx._pytree (#120027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120027
Approved by: https://github.com/lezcano, https://github.com/zou3519, https://github.com/XuehaiPan
ghstack dependencies: #119284
2024-02-23 21:52:42 +00:00
c3496d50f0 Fix torch.return_types init signature (#119284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284
Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan
2024-02-23 21:52:34 +00:00
623632a401 More informative stacklevel for autograd function warning (#120512)
Internal xref:
https://fb.workplace.com/groups/1405155842844877/posts/8064897663537295

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120512
Approved by: https://github.com/albanD
2024-02-23 21:48:55 +00:00
4d2073bc3f [Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444)
After the consolidated ```trace_rules.lookup```, we already unwrap at
2240018c03/torch/_dynamo/variables/builder.py (L712)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120444
Approved by: https://github.com/anijain2305
2024-02-23 21:22:09 +00:00
8e20385447 [c10d] fix the macro definition of NCCL_COMM_DUMP (#120502)
Summary:
Only if both macros are defined, should we dump the comm dump,
otherwise, use the original definition.

The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined

Test Plan:
Build and unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502
Approved by: https://github.com/dsjohns2, https://github.com/Skylion007
2024-02-23 20:59:39 +00:00
7cd623aa89 Remove monkey-patch for torch.utils._rebuild_tensor (#120446)
Not needed after #108186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120446
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2024-02-23 20:42:50 +00:00
ed0ea2f30b add export to torch.jit.__all__ (#120432)
I use pyright in the vscode. When I use `@torch.jit.export`, I always see an annoying error saying `export` is not exported.

![image](https://github.com/pytorch/pytorch/assets/9496702/f7b0e17f-6497-4f9a-87dd-55dc627156c3)

Adding it to `__all__` should fix it.

I have seen #92240 and #101678, and I am not sure why `export` is not there. cc @ringohoffman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120432
Approved by: https://github.com/eellison
2024-02-23 20:37:09 +00:00
e29eb39e04 [EZ] Fix typo in gcc version detection (#120489)
It should be `FATAL_ERROR` rather than `FATAL`

I wish cmakelint would have detected it

Also, downgrade this check to 9.3, as all our binary builds are using 9.3 at the moment (will update in a followup PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120489
Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007
2024-02-23 20:31:21 +00:00
007606e520 [dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120342
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096
2024-02-23 20:10:09 +00:00
4b65d192f0 [dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120096
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093
2024-02-23 20:10:09 +00:00
a92ce46dc3 [dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120093
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123
2024-02-23 20:10:01 +00:00
bb331b1eb5 [dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120123
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119
2024-02-23 20:09:52 +00:00
2eac593ffd [dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120119
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091
2024-02-23 20:09:43 +00:00
da95421f05 [dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120091
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089
2024-02-23 20:09:34 +00:00
39f0a5ecc9 [c10d] simplify the dump timeout logic and unify the async call (#120331)
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
2024-02-23 19:46:40 +00:00
8646872ff7 Make balance_gradient preserved in export (#120332)
Summary: We can only not-decompose CompositeImplicit functional custom ops. From the looks of the implementation, this op looks functional. So the fix is just fixing the schema.

Test Plan: CI

Differential Revision: D54019265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120332
Approved by: https://github.com/zhxchen17
2024-02-23 19:14:08 +00:00
182ed1e32c Use a dtype property in torch inductor nodes (#119227)
I usually forget to do `x.get_dtype()` and I type `x.dtype`. Similarly for `layout, device, sizes`. What do you think about making them properties?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119227
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-02-23 18:40:03 +00:00
d54121d13f Increase bazel CUDA tests timeout to 480s (#120443)
One of the bazel CUDA tests `//:modules_test` frequently timeout in trunk, so I try to increase the timeout value to 480s https://bazel.build/reference/test-encyclopedia to see if it helps fix the issue.  Bazel CPU tests already use this value.

Here is an example timeout https://github.com/pytorch/pytorch/actions/runs/8009308009/job/21877698886#step:13:3316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120443
Approved by: https://github.com/clee2000
2024-02-23 18:32:35 +00:00
6b35415a54 Create a sentinel file for each dynamo test skips (Part 2) (#120501)
[no ci]

tested on https://github.com/pytorch/pytorch/pull/120451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120501
Approved by: https://github.com/clee2000
ghstack dependencies: #120500
2024-02-23 18:25:30 +00:00
cffdd642a9 Create a sentinel file for each dynamo test skips (Part 1) (#120500)
[no ci]

tested on https://github.com/pytorch/pytorch/pull/120451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120500
Approved by: https://github.com/clee2000
2024-02-23 18:25:30 +00:00
2120f65174 [AT-VK][EZ] Move ops to dedicated folder (#120364)
These ops are at the level of the OperatorRegistry from the previous change. All ExecuTorch ops will go here.
```
ATen/native/vulkan/graph/ops
```
They are not to be confused with the general ATen ops from `native_functions.yaml` that will continue to exist. All PyTorch ops are here.
```
ATen/native/vulkan/ops
```

To help think around this split, note that we can actually implement the latter ATen ops with the former OperatorRegistry ops, since it's currently a subset.

Differential Revision: [D54030933](https://our.internmc.facebook.com/intern/diff/D54030933/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120364
Approved by: https://github.com/SS-JIA
ghstack dependencies: #120362, #120363
2024-02-23 18:11:09 +00:00
6d920dd3c6 [ET-VK][Op Redesign][2/n] Introduce OperatorRegistry (#120363)
TSIA

Differential Revision: [D53982439](https://our.internmc.facebook.com/intern/diff/D53982439/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120363
Approved by: https://github.com/SS-JIA
ghstack dependencies: #120362
2024-02-23 18:07:59 +00:00
3e2ac1f094 [AT-VK][EZ] Define OpNode constructor (#120362)
Instead of using `emplace_back()`. This will be useful throughout the rest of the stack.

Differential Revision: [D53982443](https://our.internmc.facebook.com/intern/diff/D53982443/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120362
Approved by: https://github.com/SS-JIA
2024-02-23 18:05:17 +00:00
232f09e0ea Add copy of scripts for setting up s390x workers (#120417)
This PR contains scripts used to produce self-hosted s390x worker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120417
Approved by: https://github.com/malfet
2024-02-23 17:01:44 +00:00
3b944113c8 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-23 17:01:22 +00:00
cyy
97918e8c37 [Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504
Approved by: https://github.com/albanD
2024-02-23 16:47:33 +00:00
2892d2f31b Revert "[inductor] Optimize welford reduction (#120330)"
This reverts commit 4c6ba16f825ca7b99133efca95da0b7112add66b.

Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/jeffdaily due to broke ROCm CI while ROCm was in unstable status ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1961623739))
2024-02-23 16:24:52 +00:00
2c85c9e77e [Memory Snapshot] Add Total memory used after allocation in Trace View (#120339)
Summary: Being able to see max allocated helps improve user experience with memory snapshots.

Test Plan:
Before:
![image](https://github.com/pytorch/pytorch/assets/17602366/534001fa-2fbe-4fc5-bd48-cd82f3277941)

After:
![image](https://github.com/pytorch/pytorch/assets/17602366/f8b9a7bc-3a34-4e38-82cb-f766e54b3fd2)

Reviewed By: zdevito

Differential Revision: D53953648

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120339
Approved by: https://github.com/zdevito
2024-02-23 16:17:14 +00:00
d9db9e62e3 Describe special case in avgpool (#120335)
Fixes #116420

AvgPool1d, AvgPool2d and AvgPool3d include now in their descriptions the special case when `ceil_mode` is True and the last window starts outside the tensor.
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120335
Approved by: https://github.com/mikaylagawarecki
2024-02-23 15:29:54 +00:00
cef9f70f4b Move torchbench model configuration into a YAML file. (#120299)
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.

This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
2024-02-23 14:00:14 +00:00
54bac042e7 Fix error in examples of torch.linalg.lu_factor (#120484)
Found an error in the doc of `torch.linalg.lu_factor` related to `torch.linalg.lu_solve`. Also fix a sphinx issue by the way.
```Python traceback
TypeError: linalg_lu_solve(): argument 'LU' (position 1) must be Tensor, not torch.return_types.linalg_lu_factor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120484
Approved by: https://github.com/lezcano
2024-02-23 13:19:04 +00:00
b96ea097ee [aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391)
make WrapperCodeGen subclass names consistent with the
file names:

CppWrapperCodeGen -> CppWrapperCpu
CudaWrapperCodeGen -> CppWrapperCuda

Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391
Approved by: https://github.com/aakhundov
2024-02-23 10:41:50 +00:00
72fec96e59 fix no shard state dict loading (#120367)
Summary: fix no shard state dict loading

Test Plan: CI tests

Differential Revision: D51058607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120367
Approved by: https://github.com/fegin
2024-02-23 07:25:43 +00:00
9e9eaf0032 [CUDA] Workaround register spilling issue in mem-efficient SDP kernels on sm60 (#120445)
We're seeing that a newer version of CUDA introduces register spilling behavior for a few kernels on Pascal---this PR works around them for this specific version.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120445
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-02-23 06:06:37 +00:00
edf1c4e552 [Dynamo] Handle guard_size_oblivious in user code (#120379)
Fixes https://github.com/pytorch/pytorch/issues/120083

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120379
Approved by: https://github.com/yanboliang
2024-02-23 05:38:57 +00:00
a5548c6886 Create a sentinel file for each dynamo test failure (#120355)
Created via
```
import os
current_dir = os.path.dirname(os.path.abspath(__file__))
directory = os.path.join(current_dir, 'dynamo_expected_failures')
for name in dynamo_expected_failures:
    path = os.path.join(directory, name)
    with open(path, 'w') as fp:
        pass
```

Differential Revision: [D54036062](https://our.internmc.facebook.com/intern/diff/D54036062)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120355
Approved by: https://github.com/aorenste, https://github.com/yanboliang
2024-02-23 05:22:11 +00:00
2240018c03 Construct c10::Half from float16_t on ARMv8 (#120425)
By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos)
```cpp
#include <arm_neon.h>

#ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
#error Ieeee
#endif

float16_t sum1(float16_t x, float16_t y) {
    return x + y;
}

float16_t sum2(float16_t x, float16_t y) {
    return static_cast<float>(x) + static_cast<float>(y);
}
```
both sum variants are  compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic
```
sum1(half, half):                            // @sum1(half, half)
        fadd    h0, h0, h1
        ret
sum2(half, half):                            // @sum2(half, half)
        fadd    h0, h0, h1
        ret
```

Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425
Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal
2024-02-23 04:22:45 +00:00
eqy
3f6be7696b [cuDNN][cuDNN RNNv8 API] Fix math type behavior in cuDNN RNN (#120277)
Adds back `CUDNN_TENSOR_OP_MATH` which was erroneously dropped by #115719

CC @malfet @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120277
Approved by: https://github.com/drisspg
2024-02-23 04:11:14 +00:00
36c1cc962a Update cutlass from 3.3.0 to 3.4.1 (#120434)
### COPY OF https://github.com/pytorch/pytorch/pull/120010

### Update
I have rolled the two blocking changes into this PR, I also imported this to fbcode to verify that nothing is breaking:
D53870253

This copy was generated by merging in all the internal only changes into one merged atomic commit and re-exporting to github

### Current Status
- [PR](https://github.com/pytorch/pytorch/pull/118935) aims to update the flash attention kernels to a more recent version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120434
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-02-23 03:57:26 +00:00
cyy
f609f2050f [structural binding][6/N] Replace std::tie with structural binding (#120353)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120353
Approved by: https://github.com/albanD
2024-02-23 03:38:40 +00:00
3426c6f559 update the tensor.scatter_ doc (#120169)
Fixes #119543

- doc fixed with the `reduce` being a kwarg (see below for details)
- doc added another interface `(int dim, Tensor index, Number value, *, str reduce)` where
the full signature in the pyi file after build is
```
def scatter_(self, dim: _int, index: Tensor, value: Union[Number, _complex], *, reduce: str) -> Tensor:
```
. This can be further verified in
02fb043522/aten/src/ATen/native/native_functions.yaml (L8014)

Therefore, the value can be int, bool, float, or complex type.

Besides the issue mentioned in 119543, the `reduce should be a kwarg` as shown below
```
 * (int dim, Tensor index, Tensor src)
 * (int dim, Tensor index, Tensor src, *, str reduce)
 * (int dim, Tensor index, Number value)
 * (int dim, Tensor index, Number value, *, str reduce)
 ```

The test case for scala value is already implemented in

70bc3b3be4/test/test_scatter_gather_ops.py (L86)

so no additional test case required.

@mikaylagawarecki  @janeyx99

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120169
Approved by: https://github.com/mikaylagawarecki
2024-02-23 02:51:55 +00:00
bb6f50929b Fix lint after https://github.com/pytorch/pytorch/pull/105590 (#120461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120461
Approved by: https://github.com/Skylion007
2024-02-23 02:45:23 +00:00
2b0168aeb0 [c10d] update the work progress of PG periodically (#120438)
Summary:
Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress
but log only when there is timeout detected.

We found it is not enough since the 'straggler' itself might not detect
the timeout and hence there is no log from the 'straggler'.

In this PR, we can log these states periorically so that it would be
much easier for us to identify the straggler by checking which rank
has the smallest number of lastEnqueuedSeq_
Test Plan:
Log adding, build success

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
2024-02-23 01:40:43 +00:00
8f4ffd3d8a [HigherOrderOp] makes control flow operators respect global decomp table (#120412)
A follow up of @zou3519 's comment on https://github.com/pytorch/pytorch/pull/120366. We create a helper method for this purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120412
Approved by: https://github.com/zou3519
2024-02-23 00:10:20 +00:00
156954d6a2 [Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590)
Fixes #104729

As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization.

The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path.

  | Slow path | Fast path (NEON intrinsics)
-- | -- | --
Softmax (100 passes, 1024 dimension) | 623.706ms | 452.011ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105590
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-02-22 23:55:35 +00:00
4c6ba16f82 [inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal

Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
2024-02-22 23:54:24 +00:00
722afe6171 Revert "[dynamo] Use EQUALS_MATCH guard for mod.training (#120147)"
This reverts commit b642a18e8056287b0e5768f631dd03e0326a8b11.

Reverted https://github.com/pytorch/pytorch/pull/120147 on behalf of https://github.com/williamwen42 due to memory leak, see https://github.com/pytorch/pytorch/issues/112090 ([comment](https://github.com/pytorch/pytorch/pull/120147#issuecomment-1960522018))
2024-02-22 23:46:55 +00:00
3588e7f265 Ignore .numpy() under FakeTensorMode() (#120261)
Fixes #120259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261
Approved by: https://github.com/jansel
2024-02-22 22:49:20 +00:00
f9eb66e16d [BE][EZ] Flatten preprocessor hierarchy (#120422)
Instead of
```cpp
#if defined(foo)
#else
#if defined(bar)
#else
#endif
#endif
```
use
```cpp
#if defined(foo)
#elif defined(bar)
#else
#endif
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120422
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-02-22 22:38:08 +00:00
1c7ba330b2 [BE][optim] Simplify _init_group. (#120055)
This version is more concise and avoids second lookup in case `momentum_buffer` is in the `state`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120055
Approved by: https://github.com/janeyx99
2024-02-22 22:15:01 +00:00
5603d95375 [DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046)
More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614

In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-02-22 22:03:13 +00:00
c11bd724fe [ROCm] replace ROCmLoops.cuh with hipified CUDALoops.cuh (#120101)
The intent of this change was to minimize code differences between CUDA and ROCm while maintaining or improving performance.

Verified new performance using pytorch/benchmarks/operator_benchmark.

```
python -u -m pt.unary_test --tag-filter all --device cuda
python -u -m pt.binary_test --tag-filter all --device cuda
```

On MI200 this improved performance on average 3%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120101
Approved by: https://github.com/albanD
2024-02-22 21:57:36 +00:00
77692736d1 Use privateuseone during external module register test (#120399)
Fixes #120397

Use privateuseone instead of xpu in test_external_module_register.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120399
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-22 21:32:59 +00:00
edd03f975f highlight readme code block (#120228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120228
Approved by: https://github.com/mikaylagawarecki
2024-02-22 21:23:08 +00:00
1eae8950b9 [Dynamic] Fix dynamic shape size inspection bug (#120341)
Fixes #120198

Differential Revision: [D54035984](https://our.internmc.facebook.com/intern/diff/D54035984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120341
Approved by: https://github.com/ezyang
2024-02-22 21:08:28 +00:00
11e4a9266d Temporarily support ranks + tag as pg identifier in native funcol (#120226)
As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #120042, #120043, #120070
2024-02-22 20:24:16 +00:00
5a3e19578f Make tests using CommDebugMode work for both legacy and native funcol (#120070)
We have many tests that use CommDebugMode to verify the occurrence of collectives. These tests do so by querying comm_counts with legacy funcol ops as key. For the purpose of native funcol migration, we need these tests to work for both legacy and native funcol. To avoid the need to modify all tests to accommodate the two implementations, we make CommDebugMode translate native funcol ops into legacy funcol ops until the migration finishes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120070
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #120042, #120043
2024-02-22 20:24:15 +00:00
a4c5f48b11 Prepare test_dtensor.py for native funcol migration (#120043)
This file contains representative tests that we would like to run with both funcol impls during the migration period. Marking them as `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120043
Approved by: https://github.com/wanchaol
ghstack dependencies: #120042
2024-02-22 20:24:15 +00:00
1c9fc720ae Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042)
Summary:
While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does.

Also marking a test affected by this with `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042
Approved by: https://github.com/wanchaol
2024-02-22 20:24:15 +00:00
7b8f6736d1 [cond] make sure subgraphs in cond are decomposed according to current decomp table (#120366)
Fixes https://github.com/pytorch/pytorch/issues/120160. The issue is because previously cond doesn't pass in the global decomposition table in ProxyMode. This PR adds the current_decomposition_table to the recursive make_fx call.

Test Plan:
see added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120366
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-02-22 20:06:46 +00:00
680cfec295 Fix the default value of side in torch.searchsorted (#120066)
Fixes #119999, currently the [doc](https://pytorch.org/docs/stable/generated/torch.searchsorted.html#torch.searchsorted) shows the default value of `side = "left"`
<img width="600" alt="Screenshot 2024-02-16 at 10 36 08 AM" src="https://github.com/pytorch/pytorch/assets/7495155/e7d159aa-4985-4f50-9d81-6e71c3116c0d">
while the [implementation ](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L11247) gives the default value of `side = c10::nullopt`.

- fix the [torch doc](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py#L13782) such that the default value of side is None.

- fix the [comment in cpp](4dc75f9084/aten/src/ATen/native/Bucketization.cpp (L19)) such that the default value of side is None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120066
Approved by: https://github.com/malfet
2024-02-22 19:35:17 +00:00
c37d07a1bc [FSDP2] Removed super().__setattr__ call (#120340)
`nn.Module.__setattr__` does not actually call `super().__setattr__()`. If we make this call in our fast path, then we will inadvertently set the parameter as an actual attribute on the module, not just as an entry in the `_parameters` dict. This can lead to a bug where after replacing the parameters on the module (e.g. via `to_empty()` from meta device), we now have both an actual attribute (old) and a new entry in `_parameters` (new). Trying to access the parameter would give the old one since Python only resolves `__getattr__` if normal attribute lookup fails.

The bug was exercised in the following PR. I wanted to land this bug fix separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120340
Approved by: https://github.com/yifuwang
ghstack dependencies: #120231
2024-02-22 19:33:57 +00:00
2ba798df60 [inductor] decompose memory bound mm (#120047)
Summary:
Decompose memory bound mm/bmm.
Linear decomposition result:  D53502768
BMM decomposition result: D53148650
 We should only decompose when
1)bmm, b is large, m,n,k is relative small
2)mm/addmm. m is large, n and K is relative small. e.g. mm of input gradient in linear backward should not be decomposed since m is small and n is large.
Need to conduct more experiments to see if we can find a better strategy for decomposition. I have tried to use a linear regression model (see the bento results) which does not fit well. For short term, we use heuristics to determine when to decompose.

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm
```

COFFEE APS mc0:
baseline: aps-lsf-0124-bf16-267ccb7a0d
decompose: aps-lsf-0124-bf16-4e3824db40

FIRST AFOC pyper mc1

Differential Revision: D53602514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120047
Approved by: https://github.com/mengluy0125
2024-02-22 19:29:51 +00:00
ce807c17c0 modify comment of SparseTensor coalesce (#120221)
Fixes #ISSUE_NUMBER
Found the comment of coalesce is incorrect, modify it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120221
Approved by: https://github.com/mikaylagawarecki
2024-02-22 19:24:53 +00:00
bb72bfe2ac Add code example for torch.stack() (#120304)
Fixes #120303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120304
Approved by: https://github.com/albanD
2024-02-22 18:30:30 +00:00
ca64f7cbb8 Fix rendering in the doc of PackedSequence (#120385)
Correct a typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120385
Approved by: https://github.com/albanD
2024-02-22 18:29:12 +00:00
a77226aa49 [inductor] improve kernel metadata logging (#120274)
Log a few more fields
- num_atomic_add: perf of kernels using atomic_add are usually data dependent. Our benchmarking code generate all indices to be 0 which will result in worse perf than reality.
- kernel_args_num_gb: estimate the amount of read/writes for kernel args. In-place args will be double counted. If we have a good estimation, this should be the lower bound of memory access that the GPU performs. Sometimes GPU will do more memory access since a single buffer may be access multiple times (e.g. for softmax when input tensor is quite large. cache only help a bit here). With this logged, and if we augment the metadata with amount of memory the GPU actually accessed, then it would be nice to dig into kernels that GPU access more memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120274
Approved by: https://github.com/jansel
ghstack dependencies: #120266
2024-02-22 18:28:05 +00:00
b88621040a [profiler] Add kineto init delay when used in daemon mode (#120276)
Fixes #112389

## About

PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec)

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.

## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable

### Without any delay
```
>$ python3 linear_model_example.py

INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```

### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py

cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly =  1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```

### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py

INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```

### Unit test updated as well.

## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
2024-02-22 18:17:33 +00:00
be0ee93467 [pytree] support X | Y union type in tree_map_only (#120389)
Follow-up PR for #119974 with some small tweaks.

1. Support `X | Y` union type for Python 3.10+
2. Enable predicate function in `tree_map_only` in CXX pytree.
3. Remove unnecessary function definition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120389
Approved by: https://github.com/zou3519
2024-02-22 18:17:13 +00:00
65627cfd6a [dtensor] implement scaled dot product attention (flash-attention) (#120298)
as titled, this PR implements the sdpa flash attention op in DTensor

Adding flash attention first but efficient attention and other attention
ops should be similar

fixes https://github.com/pytorch/pytorch/issues/120333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298
Approved by: https://github.com/XilunWu
ghstack dependencies: #120297
2024-02-22 17:53:47 +00:00
f2452e98a6 Revert "Native Half on ARM (#119483)"
This reverts commit 8f3fd79b23d483e846537b62f49111696d117870.

Reverted https://github.com/pytorch/pytorch/pull/119483 on behalf of https://github.com/malfet due to Broke nightly arm builds (and will be breaking runtime), as F16 arithmetic is ARMv8.2 only, see https://github.com/pytorch/pytorch/actions/runs/8000968963/job/21851281141 ([comment](https://github.com/pytorch/pytorch/pull/119483#issuecomment-1959944948))
2024-02-22 17:41:55 +00:00
c7328602ed [ROCm] enable tests test_sampled_addmm_autograd_cuda_*, test_sample… (#117501)
These tests PASS on ROCM 5.6+ now:

- test_sampled_addmm_autograd_cuda_complex128
- test_sampled_addmm_autograd_cuda_complex64
- test_sampled_addmm_autograd_cuda_float32
- test_sampled_addmm_autograd_cuda_float64
- test_sampled_addmm_cuda_complex128
- test_sampled_addmm_cuda_complex64
- test_sampled_addmm_cuda_float32
- test_sampled_addmm_cuda_float64
- test_autograd_dense_output_addmm_cuda_float64
- test_autograd_dense_output_addmv_cuda_float64
- test_autograd_dense_output_mv_cuda_float64

@pruthvistony @jithunnair-amd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117501
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-02-22 17:24:25 +00:00
1c1028ac49 [DCP] Adds utility for converting torch save to dcp (#119815)
as title

Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815
Approved by: https://github.com/fegin
ghstack dependencies: #119813, #119814
2024-02-22 17:22:11 +00:00
aae7ccd2d5 [FSDP2] disable compile in broken unit tests (#120358)
following unit tests are broken in original commit, revert to keep trunk healthy. will add them back when figuring out the root cuase
```
python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_param_registration
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120358
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-02-22 17:17:23 +00:00
1ab441a7dd [DCP] Adds utility for converting dcp to torch save format (#119814)
as title

Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814
Approved by: https://github.com/fegin
ghstack dependencies: #119813
2024-02-22 16:55:58 +00:00
e0a7b024b0 [ROCm] Skip test_parity* unit tests in test_foreach only if ROCm version < 6.0 (#117301)
Skip test_parity* unit tests in test_foreach.py on ROCm only if ROCm version < 6.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117301
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2024-02-22 16:21:09 +00:00
de60050801 [inductor] Colorization improvements for bandwidth profiler (#120343)
A couple things:
* Don't colorize output to the log file
* Don't repeatedly warn if colorama isn't installed

Differential Revision: [D54027075](https://our.internmc.facebook.com/intern/diff/D54027075/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120343
Approved by: https://github.com/Chillee, https://github.com/shunting314
2024-02-22 15:25:46 +00:00
03f7235caa [Dynamo] Fix dynamo trace rules (#120371)
```test_trace_rules.py``` is still failing due to this.

Fixes https://github.com/pytorch/pytorch/issues/114831
(Having this here will run the disabled test on the PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120371
Approved by: https://github.com/drisspg, https://github.com/huydhn
2024-02-22 14:32:00 +00:00
0e4bd25a33 [inductor] When generating debug logs don't fail if nvcc not found (#120346)
If nvcc isn't found subprocess throws a CalledProcessError

Differential Revision: [D54028438](https://our.internmc.facebook.com/intern/diff/D54028438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120346
Approved by: https://github.com/Skylion007, https://github.com/shunting314
2024-02-22 14:25:34 +00:00
c2b2e57032 Intel GPU Runtime Upstreaming for Guard (#118523)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR.

# Design
Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder.

# Additional Context
It is unnecessary to add `Guard` code to PyTorch frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118523
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #120315
2024-02-22 14:07:21 +00:00
dcfe463600 fix xpu build failure (#120315)
# Motivation
fix build failure introduced by [[DeviceIndex][6/N] Use DeviceIndex in more places](https://github.com/pytorch/pytorch/pull/120133), parameter `total` is undefined in line 100. see https://github.com/pytorch/pytorch/pull/120133/files#diff-00eb8a6f5dfbc341ee9ab9aff0e3dbece8ad73483d4f41a005b1f453cb78221cR91-L102
[PR120133](https://github.com/pytorch/pytorch/pull/120133) forgot to add the label `ciflow/xpu`, so the XPU CI flow was not triggered.

# Solution
refer to [Why is std::cout not printing the correct value for my int8_t number?](https://stackoverflow.com/questions/7587782) , static cast int8_t to int16_t and the condition `device >= 0 && device < total` is enough.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120315
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/malfet, https://github.com/EikanWang, https://github.com/gujinghui
2024-02-22 13:43:56 +00:00
faad8ecb26 Use opmath for sinc on CPU (#120311)
This aligns the implementation with CUDA and `torch.compile`

Fixes https://github.com/pytorch/pytorch/issues/118176 https://github.com/pytorch/pytorch/issues/49133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120311
Approved by: https://github.com/jgong5, https://github.com/Chillee
2024-02-22 12:37:50 +00:00
5c5b71b6ee Capture non tensor arguments in record_function (#120017)
Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION.

Test Plan:
unit test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture

Differential Revision: D53674768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017
Approved by: https://github.com/soulitzer
2024-02-22 09:40:08 +00:00
7e6bce9684 [amd] fix unused variable device_flags (#120369)
Summary:
get build error due to D53986297 (https://github.com/pytorch/pytorch/pull/119996)

```
caffe2/c10/cuda/__fb_c10_hipify_gen__/out/c10/hip/HIPStream.cpp:40:23: error: unused variable 'device_flags' [-Werror,-Wunused-variable]
static c10::once_flag device_flags[C10_COMPILE_TIME_MAX_GPUS];
```

Reviewed By: jianyuh, xw285cornell

Differential Revision: D54027737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120369
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-02-22 09:36:59 +00:00
5210a22b39 Add basic shampoo test (#120293)
Fixes [T175418669](https://www.internalfb.com/intern/tasks/?t=175418669)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120293
Approved by: https://github.com/bdhirsh
2024-02-22 08:39:55 +00:00
354a436d96 Remove device assert in Gradscaler (#119362)
Fixes #119358

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Co-authored-by: ydwu4 <ydwu2014@gmail.com>
Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Shuqiang Zhang <sqzhang@meta.com>
Co-authored-by: Adnan Akhundov <aakhundov@meta.com>
Co-authored-by: Ting Lu <tingl@nvidia.com>
Co-authored-by: Yang Chen <yangche@fb.com>
Co-authored-by: cyy <cyyever@outlook.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Jason Ansel <jansel@meta.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: wz337 <wz337@cornell.edu>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Co-authored-by: Anthony Alayo <anthony.alayo@applovin.com>
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Co-authored-by: Yifu Wang <yifu@fb.com>
Co-authored-by: Yukio Siraichi <yukio.siraichi@gmail.com>
Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: haozhe.zhu <haozhe.zhu@intel.com>
Co-authored-by: lezcano <lezcano-93@hotmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119362
Approved by: https://github.com/ezyang
2024-02-22 08:02:18 +00:00
fff9d98e58 Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)"
This reverts commit e0268821dd2ea0e8a51b81c0ef3b18e77f68a33d.

Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. 450339ab2d ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))
2024-02-22 00:12:54 +00:00
8fa6340701 Revert "Ignore .numpy() under FakeTensorMode() (#120261)"
This reverts commit 952b37145b7bb526ea5907ac574e324d274b02ee.

Reverted https://github.com/pytorch/pytorch/pull/120261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems breaking trunk on Python 3.12 952b37145b ([comment](https://github.com/pytorch/pytorch/pull/120261#issuecomment-1958267417))
2024-02-21 23:09:27 +00:00
cyy
1aad5c98b4 [structural binding][5/N] Replace std::tie with structural binding (#120142)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120142
Approved by: https://github.com/albanD
2024-02-21 22:32:55 +00:00
d514df63ea Reenable triton tests and clean extra clones after the pin update (#120324)
Test Plan: just tests

Differential Revision: D54008642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120324
Approved by: https://github.com/aakhundov, https://github.com/sijiac
2024-02-21 22:25:33 +00:00
952b37145b Ignore .numpy() under FakeTensorMode() (#120261)
Fixes #120259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261
Approved by: https://github.com/jansel
2024-02-21 22:06:29 +00:00
450339ab2d Test for fatal signal in test_pynode_destruction_deadlock (#120279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120279
Approved by: https://github.com/albanD
2024-02-21 21:53:51 +00:00
306642b66d [export] fix test_passes on ci (#120322)
We put the test cases generation in unitest.setUp to avoid running export on machines that runs with Python 3.12, where dynamo is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120322
Approved by: https://github.com/angelayi, https://github.com/huydhn, https://github.com/malfet
2024-02-21 21:23:40 +00:00
e0268821dd Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)
Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:

- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`

[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-02-21 21:10:49 +00:00
27c5bbe5cb Add is_nested_int() (#119975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119975
Approved by: https://github.com/jbschlosser
ghstack dependencies: #119661, #119974
2024-02-21 21:10:02 +00:00
2e77629b9f [pytrees] Allow tree_map_only to support predicate function as filter (#119974)
In many places in the code we use `tree_map_only((SymInt, SymBool, SymFloat), foo)` but with nested ints, it is possible to have SymInts that are non-symbolic, so we may want to do something like `tree_map_only(is_symbolic, foo)` instead.

Alternative: wrap nested int SymNodes with something other than SymInt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119974
Approved by: https://github.com/zou3519
ghstack dependencies: #119661
2024-02-21 21:10:02 +00:00
722e87899a [Memory Snapshot] Clean up elem text (#120245)
Summary:
These UI changes were added:
- Prefix address with Addr: and size with Size:
- Add comma between addr and size
- Remove duplicate (${elem.size} bytes) print out

Test Plan:
Before:
![image](https://github.com/pytorch/pytorch/assets/17602366/2d9867d6-9cdb-405b-aa92-f0daf44f2ba7)
After:
![image](https://github.com/pytorch/pytorch/assets/17602366/c7bd97d3-fdc6-4832-ae35-97a02ea73907)

Differential Revision: D53953187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120245
Approved by: https://github.com/zdevito
2024-02-21 20:59:04 +00:00
a5893926f2 [dtensor] simplify outputs wrapping handling (#120297)
This PR simplifies the outputs wrapping handling in op dispatch, to make
it simpler and easier to understand.

It also enables a new case, where if the output DTensorSpec for the res is
None, and the res is a scalar tensor, we will just return the scalar
tensor instead of wrapping it with a DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120297
Approved by: https://github.com/wz337
2024-02-21 20:28:20 +00:00
e06978be4b [CI] Add initial inductor cpu smoketest for performance (#116456)
Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116456
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-02-21 20:04:50 +00:00
9630bcbd49 [execution trace/chakra] remove backend_id from pg_info (#120038)
Summary:
PR 104373(https://github.com/pytorch/pytorch/pull/104373) log backend which has an unsafe dict loop up that might fail.
We decide to deprecate backend_id and use pg id/name directly.

Differential Revision: D53676181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120038
Approved by: https://github.com/aaronenyeshi
2024-02-21 19:37:18 +00:00
e7eab2f07e Fix to keep stride in return_and_correct_aliasing() (#117860)
Fixes #117794

Fix tripped the assert here: 86dedebeaf/torch/utils/_python_dispatch.py (L216)

From investigation: I found that functionalization of an in-place op (`mul_` in this test case) results in the strides of `TwoTensor`'s `a` / `b` components being mutated to be contiguous. This is not reflected in the outer tensor, causing the assert to be tripped.

After discussion with Brian, I address this in this PR by disallowing input mutations on non-contiguous tensor subclass inputs for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117860
Approved by: https://github.com/bdhirsh
2024-02-21 19:15:27 +00:00
fa77829126 Remove bc linter label triggers after test-infra #4956 (#120148)
After https://github.com/pytorch/test-infra/pull/4956, mergebot will not block merge for a bc linter failure that has been suppressed.  The failure will be ignored instead.

This should help mitigate https://github.com/pytorch/test-infra/issues/4938 because the workflow will not be triggered multiple times when labels are attached.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120148
Approved by: https://github.com/clee2000
2024-02-21 18:36:38 +00:00
e87deb8004 fix: conversion of max memory allocated and reserved to GB (#120172)
Fixes #120171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120172
Approved by: https://github.com/soulitzer, https://github.com/aaronenyeshi
2024-02-21 18:04:47 +00:00
d336be2942 Update torch.mean() description about dtype restriction. (#120208)
Fixes #120173

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120208
Approved by: https://github.com/soulitzer
2024-02-21 18:04:11 +00:00
9c64068ef8 [dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120089
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068
2024-02-21 17:56:48 +00:00
ec6783990a [dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120068
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067
2024-02-21 17:56:48 +00:00
66c52d678f [dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120067
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065
2024-02-21 17:56:36 +00:00
7a0c2a9d0a [dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120065
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064
2024-02-21 17:56:18 +00:00
8d5ae8c0b3 [dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120064
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062
2024-02-21 17:56:05 +00:00
034955b2fc [dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120062
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061
2024-02-21 17:55:46 +00:00
cc6cf89c30 [dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120061
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060
2024-02-21 17:55:32 +00:00
5066bec743 [dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120060
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833
2024-02-21 17:55:17 +00:00
8f3fd79b23 Native Half on ARM (#119483)
Summary: Native Half on ARM

Test Plan: sandcastle

Differential Revision: D53585776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119483
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-02-21 17:46:16 +00:00
29b2131c62 [Inductor] Fix bug around out of order constexprs in inductor (#120287)
Inductor signature/config generation code assumes that all constexprs come as last arguments of the function. This is not always true for user defined kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120287
Approved by: https://github.com/jansel
2024-02-21 17:39:41 +00:00
cfddfce0d3 Alternate sharding (#119078)
Changes sharding to attempt to put all serial tests on as few shards as possible.  Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards

Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests

Move serial tests to run first

If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.

See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
2024-02-21 16:40:27 +00:00
a24cba35b0 [c10d][flight recorder] dump additinal NCCL debug info (#120063)
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest

Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
2024-02-21 16:35:23 +00:00
06bc203c7b Update dynamo_test_failures list (#120271)
This PR removes and adds some failures and successes that were hidden in the past week (ish).

https://github.com/pytorch/pytorch/pull/119408 (47182a8f4b5e36e280ca3595ba134f53499d2dc9) accidentally removed environment variables on rerun (see PR body of https://github.com/pytorch/pytorch/pull/120251 for slightly more details).

Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry.  Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120271
Approved by: https://github.com/DanilBaibak, https://github.com/zou3519
2024-02-21 16:34:34 +00:00
9199468401 Properly trace into mark_static (#120232)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120232
Approved by: https://github.com/yanboliang
2024-02-21 13:51:31 +00:00
d38a3627a5 Support privateUser1 key in RNN op. (#118182) (#118351)
Support privateUser1 key in RNN op。

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118351
Approved by: https://github.com/bdhirsh
2024-02-21 13:51:27 +00:00
eae025b1d7 Fix bug with block pointer multi dim args (#120263)
Summary:
Now we can parse statements like
```
%22 = tt.make_tensor_ptr %20, [%21, %c128_i64], [%c2048_i64, %c1_i64], [%1, %c0_i32]
```

Test Plan:
Added new test

```
buck2 test mode/opt //hammer/ops/tests/inductor:ragged_hstu_test
```
now passes again with optimizations

Differential Revision: D53975130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120263
Approved by: https://github.com/aakhundov, https://github.com/sijiac
2024-02-21 09:06:20 +00:00
cyy
3cd6a21e8f [DeviceIndex][6/N] Use DeviceIndex in more places (#120133)
This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133
Approved by: https://github.com/Skylion007
2024-02-21 06:24:23 +00:00
cyy
d5d13ab15e Remove C10_FALLTHROUGH (#120157)
Since [[fallthrough]] is supported in our C++17 compilers and no other repo is using it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120157
Approved by: https://github.com/Skylion007
2024-02-21 06:18:58 +00:00
d6801578c3 Update tracing rules for new cudnn functions (#120268)
# Summary
This updates the trace_rules with the new cudnn torch functions for sdpa

To repro:
`pytest test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120268
Approved by: https://github.com/shuqiangzhang, https://github.com/huydhn, https://github.com/yanboliang
2024-02-21 05:22:44 +00:00
65519d183b Remove old optimizer tests (#120257)
Removes old tests now that all configs are covered in test_compiled_optimizers.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120257
Approved by: https://github.com/eellison
2024-02-21 05:11:23 +00:00
b4cef25a1e add register_device_op_overrides (#119268)
Fixes #119267

Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268
Approved by: https://github.com/jansel
2024-02-21 04:53:07 +00:00
3993771617 Expose recordSize in ChunkRecordIterator (#120239)
Summary: Add a public method to read recordSize in ChunkRecordIterator

Test Plan: ci

Differential Revision: D53931944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120239
Approved by: https://github.com/zoranzhao
2024-02-21 04:33:03 +00:00
26610175d2 pass device_str for async_compile.triton function (#120202)
Fixes #120203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120202
Approved by: https://github.com/jansel
2024-02-21 03:48:57 +00:00
800e9acd43 [inductor] fix bandwidth extimation for StarDep (#120266)
A lot of HF models fail when inductor_config.bechmark_kernel is enabled. The reason is the bandwidth estimation code assumes every dependencies has an index but StarDep does not. An exception is raised when StarDep.index is being accessed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120266
Approved by: https://github.com/eellison, https://github.com/jansel
2024-02-21 03:33:45 +00:00
20f7e5a719 Remove dependency of triton during inductor codegen (#120193)
Fixes #120192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120193
Approved by: https://github.com/jansel
2024-02-21 01:09:48 +00:00
dd6b5e236e Prepare test_inductor_collectives.py for native funcol migration (#120025)
There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`.

Other tests are marked with `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025
Approved by: https://github.com/wanchaol
ghstack dependencies: #119982
2024-02-21 00:46:25 +00:00
af765dbdfd [ez] Explicit env for run_test (#120251)
env=None (which is the default) inherits the env from the calling process.  Explicitly set the env to the calling process env so that things can be added to it later

Tested in: e7b4d8ec88
Checked that test-reports (which depend on the CI env var) get made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251
Approved by: https://github.com/huydhn
2024-02-21 00:40:19 +00:00
a1fc29cd78 Revert "[pytree] add function tree_iter (#120155)"
This reverts commit 372d078f361e726bb4ac0884ac334b04c58179ef.

Reverted https://github.com/pytorch/pytorch/pull/120155 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/120155#issuecomment-1955479765))
2024-02-21 00:21:28 +00:00
701f651f9c Change the parameter type from int to float in torch.nn.Softplus (#120183)
Fixes #120175

1 The c_api uses the double
f2cf0768d1/torch/csrc/api/include/torch/nn/options/activation.h (L501).

2 The type is also double in the test case
f2cf0768d1/test/cpp/api/functional.cpp (L1788)

3 With float parameter in python works perfectly fine
```
m = nn.Softplus(beta=0.1,threshold=1.2)
input = torch.randn(2)
output = m(input)

print(output)
tensor([7.3749, 7.6852])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120183
Approved by: https://github.com/mikaylagawarecki
2024-02-21 00:14:38 +00:00
35891e5007 Explicitly set nn.Module.set_extra_state return type to None (#120161)
Implicitly, the return type of `set_extra_state` is `NoReturn` since it always raises an error, and pyright will complain about mismatched return types if you override it with an implementation that doesn't also always raise an error. If we explicitly hint the return type as `None` (how we expect people to override it), we can avoid this error message.

```
Method "set_extra_state" overrides class "Module" in an incompatible manner
    Return type mismatch: base method returns type "NoReturn", override returns type "None"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120161
Approved by: https://github.com/mikaylagawarecki
2024-02-20 23:57:36 +00:00
e54c4e8659 [aot_autograd] handle subclass input mutations correctly in collect_metadata_analysis.py (#120136)
This PR fixes the issue in https://github.com/pytorch/pytorch/issues/120188.

In collect_metadata_analysis.py, handling of input/output mutations was different from handling in other locations. In other locations, MUTATED_OUT_GRAPH was used to indicate that mutation would require returning an output; in collect_metadata_analysis.py, any type of mutation was being handled as if it would require returning an output.

This PR changes collect_metadata_analysis to match other callsites and refactors computation of mutation types so that it is a property of the dataclass instead of something that needs to be computed manually when constructing an InputAliasInfo.

Differential Revision: [D53950998](https://our.internmc.facebook.com/intern/diff/D53950998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #120141
2024-02-20 23:30:57 +00:00
b36404159d [aot_autograd] support inplace mutations for subclasses (#120141)
This PR removes the conditional logic depending on requires_subclass_dispatch for mutation handling.

Inputs are labeled with one of three labels: NOT_MUTATED, MUTATED_IN_GRAPH, or MUTATED_OUT_GRAPH. MUTATED_IN_GRAPH indicates mutation that is allowed in the aot autograd graph; MUTATED_OUT_GRAPH indicates mutation that is not allowed in the graph, so the result is computed, returned, and then assigned back to the input after the graph.

Previously, there was logic to handle subclasses differently, so that MUTATED_IN_GRAPH + subclasses would behave like MUTATED_OUT_GRAPH.

This PR simplifies aot_autograd's handling of mutations so that MUTATED_IN_GRAPH will always be handled in graph, even when subclasses are present. Note that there are still some cases where subclass support won't be handled correctly.

Differential Revision: [D53950999](https://our.internmc.facebook.com/intern/diff/D53950999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120141
Approved by: https://github.com/bdhirsh
2024-02-20 23:30:57 +00:00
96092e1f55 Extend aot_graph_input_parser to sym shapes (#120246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120246
Approved by: https://github.com/shunting314
2024-02-20 23:24:45 +00:00
7acdd08fcc [FSDP2] Used stream APIs for CUDA event handling (#120231)
If we already have Python `Stream` objects, then calling `stream1.wait_stream(stream2)` is syntactic sugar for creating an `event: Event`, recording it in `stream2`, and calling `stream1.wait_event(event)`.

~~Getting a Python `Stream` object incurs some CPU overhead, so we prefer to not change other callsites where we do not already have the `Stream` objects.~~
Update: Calling `event.record()` with no stream specified calls `torch.cuda.current_stream()`, so the overhead should be identical.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120231
Approved by: https://github.com/yifuwang
ghstack dependencies: #118298, #119985
2024-02-20 21:35:46 +00:00
dfb83df889 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit 47182a8f4b5e36e280ca3595ba134f53499d2dc9.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it.  Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))
2024-02-20 21:28:13 +00:00
2d6c0cc81b Run test_functional_api.py with both legacy and native funcol impls (#119982)
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
2024-02-20 21:15:37 +00:00
d42ede8ae4 [torch.compile] Log compilation start time for timeline view (#120220)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120220
Approved by: https://github.com/angelayi
2024-02-20 21:07:40 +00:00
be8ba5ef2d Revert "use two pass reduction for deterministic reduction order (#11… (#120243)
This reverts commit cc7ef43423fe36cf1778a9c9643454d62050a5b5.

Manual revert because of the conflict in: test/inductor/test_cpu_repro.py , conflict with this PR: https://github.com/pytorch/pytorch/pull/118365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120243
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-02-20 20:50:29 +00:00
4f0f25b7ce [Inductor][bugFix] fix a bug in merge_splits (#119956)
Summary: RecGPT got a keyerror when running the split_cat, and it was caused by a corner case hit.

Test Plan: P1184947021

Differential Revision: D53791839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119956
Approved by: https://github.com/jackiexu1992
2024-02-20 20:38:34 +00:00
957f37686a Refactor instance_descriptor for new triton version (#119636)
Check https://github.com/pytorch/pytorch/pull/119457#issuecomment-1936764161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119636
Approved by: https://github.com/shunting314
2024-02-20 20:26:35 +00:00
8464654ae4 Add missing words to torch.utils.checkpoint doc (#120196)
This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it.

Changes are:
- "backward." -> "backward propagation."
- "to be advanced than" -> "to be more advanced than"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196
Approved by: https://github.com/soulitzer
2024-02-20 20:18:42 +00:00
b33e8d3f6b [Inductor][fx pass] Add split cat pattern to remove cat nodes (#115004)
Summary: Titled

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/8e4179db-363a-41b5-8bd7-cc445a512f6f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598708548039
Network: Up: 91KiB  Down: 32KiB  (reSessionID-b0985d82-1919-49c5-b307-ee0ab49b4738)
Jobs completed: 28. Time elapsed: 1:27.1s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce (IG_CTR)
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
P895047189

Differential Revision: D51777617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115004
Approved by: https://github.com/jackiexu1992
2024-02-20 19:35:20 +00:00
cccacf6c8e add a test that non_overlapping checks dont generate too many guards (#120106)
Pre-emptive test in OSS to ensure that models relying on the "non-overlapping guards" checks do not suffer drastically w.r.t. guard slowness. Current plan is to follow up on this with a "real" fix, to generate a linear number of these guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120106
Approved by: https://github.com/mlazos
2024-02-20 18:38:59 +00:00
6d82a7e9b0 Add pixel_shuffle to core aten decomps (#120092)
Summary:
https://github.com/pytorch/pytorch/pull/118239 added a decomposition
for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We
have also fixed the internal use case so that it no longer special cases on
pixel_shuffle, allowing us to revert the changes in
https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53860966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092
Approved by: https://github.com/ydwu4
2024-02-20 18:37:32 +00:00
53bfae2c06 [MPS] Add torch.fft. support (#119670)
Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 )

When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing.

Fixes https://github.com/pytorch/pytorch/issues/78044
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670
Approved by: https://github.com/kulinseth, https://github.com/albanD
2024-02-20 18:23:06 +00:00
5f3f8fd3c7 [Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450)
`CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`.

Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450
Approved by: https://github.com/soulitzer
2024-02-20 16:58:20 +00:00
d3839b624b [ROCm] HIP Lazy Streams (#119996)
For ROCm/HIP, each stream is lazily initialized rather than creating all streams when the first stream is requested. HIP streams are not as lightweight as CUDA streams; the pooling strategy can affect performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119996
Approved by: https://github.com/ezyang
2024-02-20 16:24:04 +00:00
26fbbc3e84 DTensor + dynamo: fix is_shard/replicate always inlining to False (#118668)
Fixes an internal enablement bug. When dynamo traces `is_sharded`/`is_replicate`, it would unconditioanlly assume the result was False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118668
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #117667, #117666, #118209, #118191, #118667
2024-02-20 15:23:48 +00:00
609cde94f9 DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667)
This fixes an internal DTensor enablement bug (I don't have an OSS issue for it)

I finally root-caused this as follows:

(1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn)

(2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549

(3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly.

(4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in.

I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667
Approved by: https://github.com/wanchaol
ghstack dependencies: #117667, #117666, #118209, #118191
2024-02-20 15:23:48 +00:00
6819452a08 fix multiple-fake-modes bug with compile + subclasses (#118191)
This should fix the "multiple fake modes" errors we've been seeing with both float8 tensor and DTensor.

Haven't added a test yet - will add one before landing.

I also have a separate PR that would have made the error significantly nicer (the bad error resulted from us returning a FakeTensor at runtime): https://github.com/pytorch/pytorch/pull/118644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118191
Approved by: https://github.com/drisspg
ghstack dependencies: #117667, #117666, #118209
2024-02-20 15:23:41 +00:00
b4b1480b06 remove redundant to_dtype in Fused Schedular Nodes (#118365)
Fix https://github.com/pytorch/pytorch/issues/115260.
This issue is triggered by `FusedSchedularNodes` cases.
We always store `lowp buffer` to `store_cache` then load `lowp buffer` from `store_cache` and `convert it to float` before `compute ops`.
Now we will generate a `{key: to(float32)_expr, value: the float32 cse var before to_dtype and store}` in `cse.cache`.
Then the `to_dtype(float32)` after `load` will hit this cache and not generate a new var with cast codes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118365
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-02-20 13:35:03 +00:00
c28a43988e Fix typo under aten/src/ATen/native directory (#119686)
This PR fixes typo in comments and msgs under `aten/src/ATen/native` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119686
Approved by: https://github.com/lezcano, https://github.com/malfet
2024-02-20 06:31:10 +00:00
389b56b4c4 [dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119833
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827
2024-02-20 05:33:08 +00:00
96f45d15d8 [dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119827
Approved by: https://github.com/jansel
ghstack dependencies: #119822
2024-02-20 05:33:08 +00:00
0802951081 [dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822)
The full blown implementation is in this stack - https://github.com/pytorch/pytorch/pull/110590 which is passing all the test cases on CI. That stack is hard to review. So, breaking apart.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119822
Approved by: https://github.com/jansel
2024-02-20 05:33:08 +00:00
0512ba43ab [executorch hash update] update the pinned executorch hash (#120214)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120214
Approved by: https://github.com/pytorchbot
2024-02-20 04:13:02 +00:00
a7e2b609d3 Skip less replacements (#119570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119570
Approved by: https://github.com/ezyang
2024-02-20 04:10:33 +00:00
cc7ef43423 use two pass reduction for deterministic reduction order (#115620)
## Motivation
Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`.

## Latest update on 1.15:
55d81901bc.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap.
```
vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)
```
Examples code:
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}
```
will be changed to
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}
```

## Descriptions
Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order.
9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)
9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)
```
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);
```

## Test results
I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):
| |  before this PR   | after this PR  |
| ---- |  ----  | ----  |
| torchbench | 1.303  | 1.301 |
| hugginface | 1.346  | 1.343 |
| timms | 1.971 | 1.970 |

```
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done
```
System info
```
ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-02-20 00:46:59 +00:00
ae7830051d [BE] Delete GCC-7 ICE workarounds (#120122)
As one needs gcc-9 to compile PyTorch, so those workarounds are no longer relevant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120122
Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/Skylion007
2024-02-20 00:31:20 +00:00
0bdeaad936 Revert "add register_device_op_overrides (#119268)"
This reverts commit 2864a7e161cc107f7e4c00cccdf860a6089c73c3.

Reverted https://github.com/pytorch/pytorch/pull/119268 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/119268#issuecomment-1953231324))
2024-02-19 22:31:32 +00:00
3ad067fe2b [CPP] Update GCC minversion check to 9 or newer (#120126)
It's already a requirement for building PyTorch, but should be a
requirement for linking extensions with it, as that can lead to runtime
crashes, as `std::optional` template layout is incompatible between
gcc-9 and older compilers.

Also, update minimum supported clang version to 9.x(used to build Android), as clang-5 is clearly not C++17 compliant.

Fixes https://github.com/pytorch/pytorch/issues/120020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120126
Approved by: https://github.com/Skylion007
2024-02-19 22:05:00 +00:00
48bdd0fb47 [ROCm] TunableOp bugfix filename handling (#120144)
Fixes nightly wheel seg fault during pytorch shutdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120144
Approved by: https://github.com/xw285cornell
2024-02-19 21:31:29 +00:00
f1fbba8f35 Revert "Fix lint after #119268 (#120207)"
This reverts commit d9d0f1dccc59ce6f0cb150ac236654c24a0d1118.

Reverted https://github.com/pytorch/pytorch/pull/120207 on behalf of https://github.com/atalman due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/120207#issuecomment-1953170249))
2024-02-19 21:21:12 +00:00
a73a98c9ae Revert "Updating sleef submodule to eb3d97785 to fix export errors (#119953)"
This reverts commit fa9cbdce993601276765ad7701871f7e04a400c6.

Reverted https://github.com/pytorch/pytorch/pull/119953 on behalf of https://github.com/atalman due to Broke trunk linux-focal-cpu-py3.10-gcc9-bazel-test and linux-focal-cuda12.1-py3.10-gcc9-bazel-test. These are not flaky failures. ([comment](https://github.com/pytorch/pytorch/pull/119953#issuecomment-1953118780))
2024-02-19 20:26:33 +00:00
d9d0f1dccc Fix lint after #119268 (#120207)
Fixes lint after: https://github.com/pytorch/pytorch/issues/119268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120207
Approved by: https://github.com/davidberard98
2024-02-19 20:01:45 +00:00
92bf2a4550 [torchbench] Update skipped models. (#120117)
This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of
the changes:

- `detectron2_maskrcnn`: #120115
- `fambench_xlmr`: moved to canary models
- `hf_Bert` and `hf_Bert_large`: pass
- `maml`: pass
- `clip`: renamed to `hf_clip`
- `gat`, `gcn`, and `sage`: moved to canary models

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-02-19 18:08:32 +00:00
637cf4a3f2 Test parametrization utils for native funcol migration (#119950)
```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.

run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).

run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.

run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```

This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
2024-02-19 02:46:03 +00:00
40786ca509 Handle unwaited work objects on process termination (#119881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881
Approved by: https://github.com/wconstab
2024-02-19 02:46:02 +00:00
84de851539 [Inductor] Enable the decomposition of quant/dequant per channel (#119177)
**Summary**
Part 2 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type.
Enable decomposition of quant/dequant per channel to make it vectorized code generation.

**TestPlan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8_bf16_input
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8_bf16_input
```

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119177
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-02-19 01:30:44 +00:00
fa9cbdce99 Updating sleef submodule to eb3d97785 to fix export errors (#119953)
Fixes #119952 with submodule updates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119953
Approved by: https://github.com/ezyang
2024-02-19 00:56:24 +00:00
f2cf0768d1 [dynamo][distributed] handle _rank_not_in_group, _get_or_create_default_group (#119628)
Copy of #117692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119628
Approved by: https://github.com/yanboliang
2024-02-18 22:34:35 +00:00
372d078f36 [pytree] add function tree_iter (#120155)
Fixes #119768

- #119768

This PR adds a new function `tree_iter` that lazily iterates over the tree leaves. It is different than the `tree_leaves` function while the latter traversal the whole tree first to build a list of leaves.

```python
for leaf in tree_iter(tree):
    ...
```

is much more efficient than:

```python
for leaf in tree_leaves(tree):
    ...
```

where `tree_leaves(tree)` is `list(tree_iter(tree))`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120155
Approved by: https://github.com/vmoens
2024-02-18 09:16:50 +00:00
61a3a7628c [nit][DTensor][Test] Update test name to reflect the actual test (#118960)
test_name: test_partial_mul_failure -> test_partial_mul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118960
Approved by: https://github.com/XilunWu
2024-02-18 08:23:06 +00:00
2864a7e161 add register_device_op_overrides (#119268)
Fixes #119267

Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268
Approved by: https://github.com/jansel
2024-02-18 06:11:54 +00:00
70bc3b3be4 [executorch hash update] update the pinned executorch hash (#120165)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120165
Approved by: https://github.com/pytorchbot
2024-02-18 03:44:50 +00:00
d74bdd5042 [inductor] Always allow 64 bit in next_power_of_2 (#120164)
see #120153 #120152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120164
Approved by: https://github.com/yanboliang
2024-02-18 03:22:46 +00:00
de15781af0 [cuDNN] Bump cuDNN frontend submodule to 1.1.1 (#120137)
Hopefully addresses the failure seen when trying to bump to 1.1.0 (#119642) CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120137
Approved by: https://github.com/Skylion007
2024-02-18 02:57:02 +00:00
b642a18e80 [dynamo] Use EQUALS_MATCH guard for mod.training (#120147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120147
Approved by: https://github.com/jansel
ghstack dependencies: #120132, #120140, #120145
2024-02-18 00:31:36 +00:00
0b11b0edd6 [dynamo][refactor] Use existing helper functions for CLOSURE_MATCH (#120145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120145
Approved by: https://github.com/jansel, https://github.com/Fidget-Spinner
ghstack dependencies: #120132, #120140
2024-02-18 00:31:36 +00:00
0c972c7c4e enhance next_power_of_2 function (#120153)
Fixes #120152

cc  @ezyang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @jansel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120153
Approved by: https://github.com/jansel
2024-02-17 20:18:46 +00:00
2fea475215 [dynamo] Refactor reconstruct() not to return anything (#120150)
This simplifies things slightly and avoids some bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120150
Approved by: https://github.com/yanboliang
2024-02-17 17:13:41 +00:00
757fc663a8 [dynamo][refactor] Use TYPE_MATCH instead of manually constructing guard (#120140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120140
Approved by: https://github.com/jansel, https://github.com/yanboliang
ghstack dependencies: #120132
2024-02-17 16:03:36 +00:00
48d96c08f2 [dynamo][guards] Use EQUALS_MATCH for NAME_MATCH (#120132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120132
Approved by: https://github.com/jansel, https://github.com/yanboliang
2024-02-17 16:03:36 +00:00
cyy
a9953a5ef3 Remove unused c10/util/C++17.h inclusion and outdated checks (#120149)
This is a continued work to clean up pre-C++17 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120149
Approved by: https://github.com/ezyang
2024-02-17 14:28:17 +00:00
fac598c4ae [inductor] allow padding mm/bmm/addmm in the presence of dynamic dims (#120073)
Previously, pad_mm skips cases where any input tensor has symbolic
dimension or stride. This is too constraint in practise.
This PR enables this pass to pad non-symbolic dimensions in
the presence of dynamic dims. For example, with this PR, we could
pad the K dimension (i.e. 1921) for torch.mm(A[s0, 1921], B[2048, 1921]).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120073
Approved by: https://github.com/jansel
2024-02-17 12:22:20 +00:00
2f8a80ecb2 Fix skip for test_set_nccl_pg_timeout (#120130)
Test is failing on our internal CI with below error
```RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Purpose of this test is for nccl so it doesnt make sense to run in 1 GPU setting either.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120130
Approved by: https://github.com/wconstab, https://github.com/eqy
2024-02-17 07:36:14 +00:00
badf84bd6b [inductor] Add torch.cond support to JIT Inductor (#119759)
Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them.

In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs.

Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 24 tests in 86.790s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759
Approved by: https://github.com/jansel, https://github.com/eellison
2024-02-17 07:25:27 +00:00
30000aa3fd [c10d] remove one line of verbose log (#120138)
Summary:
I don't find exiting DBG mode support in c10d. This is flooding the log, removing it to unblock user
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120138
Approved by: https://github.com/wconstab
2024-02-17 06:39:57 +00:00
fa0e39560c [AOTI] Fix a typo (#120094)
Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094
Approved by: https://github.com/khabinov, https://github.com/sijiac
2024-02-17 05:28:58 +00:00
0a7471e0df [executorch hash update] update the pinned executorch hash (#120134)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120134
Approved by: https://github.com/pytorchbot
2024-02-17 05:00:35 +00:00
ac2ba7889d [export] turn on replace_set_grad_with_hop_pass in pre_dispatch (#119915)
This PR turns on replace_set_grad_with_hop_pass for pre_dispatch export. To do that, we need to propagate the meta-data from original submodule to the new higher order op and fix the names of nodes as is required by the _sig_to_specs pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119915
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810, #119913, #119914
2024-02-17 02:18:35 +00:00
737630268c [export] manuually create test cases for split and inline (#119914)
This PR makes the tests for inline and sequential_split stop relying on set_grad_enabled to be in the graph. Because they'll be gone if we turn on the replace_set_grad_with_hop_pass in the following diff. Instead, we'll manually insert them into the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119914
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810, #119913
2024-02-17 02:18:35 +00:00
8d81e61fb6 [export] make node_inline_ also inline the get_item calls (#119913)
As titled. Before the PR, after we split then inline_, there will be getitem calls in the graph while the original graph module doesn't have them. This PR removes the additional get_item calls by inlining.

Test Plan:
Added new test cases for graphs that return multiple outputs and takes multiple inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119913
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810
2024-02-17 02:18:27 +00:00
812f05d731 [export] add replace_set_grad_with_hop_pass (#119810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119810
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736
2024-02-17 02:18:19 +00:00
4769e6916a [export] add node_inline_ to prepare replacing set_grad_enabled with hop (#119736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119736
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732
2024-02-17 02:18:11 +00:00
068659ddc2 [export] add sequential_split to prepare replacing set_grad_enabled with hop (#119732)
This pr is the 1/N pr of transforming the global state mutating ops  such as torch._C.set_grad_enabled calls in pre-dispatch graph into a higher order op so that the graph becomes more functional. We make use of split_module to help us do the transformation.

This pr preserves the node.name in original module by adding a new kwarg `keep_original_node_name` to split_module.

For a graph looks like this:
```python
def forward(self, arg_0):
    arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
    add = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    sin = torch.ops.aten.sin.default(add);  add = None
    sum_1 = torch.ops.aten.sum.default(sin);  sin = None
    _set_grad_enabled = torch._C._set_grad_enabled(False)
    add_1 = torch.ops.aten.add.Tensor(sum_1, 1);  sum_1 = None
    _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
    sub = torch.ops.aten.sub.Tensor(add_1, 1)
    return pytree.tree_unflatten((add_1, sub), self._out_spec)
```
Before the change, split graph returns the following graphs and subgraphs (notice the change from `add` -> `add_tensor`, `sin` -> `sin_default`:
```python
def forward(self, arg_0):
    arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
    submod_0 = self.submod_0(arg0_1);  arg0_1 = None
    submod_1 = self.submod_1(submod_0);  submod_0 = None
    submod_2 = self.submod_2(submod_1)
    return pytree.tree_unflatten((submod_1, submod_2), self._out_spec)

# submod_0
def forward(self, arg0_1):
    add_tensor = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    sin_default = torch.ops.aten.sin.default(add_tensor);  add_tensor = None
    sum_default = torch.ops.aten.sum.default(sin_default);  sin_default = None
    return sum_default

# submod_1
def forward(self, sum_1):
    _set_grad_enabled = torch._C._set_grad_enabled(False)
    add_tensor = torch.ops.aten.add.Tensor(sum_1, 1);  sum_1 = None
    return add_tensor

# submod_2
def forward(self, add_1):
    _set_grad_enabled = torch._C._set_grad_enabled(True)
    sub_tensor = torch.ops.aten.sub.Tensor(add_1, 1);  add_1 = None
    return sub_tensor
    """)

```

After the change, the test produce the following graph, all the node names in original graph module are preserved in sub_modules.
```python

def forward(self, arg_0):
    sub, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
    submod_0 = self.submod_0(sub);  sub = None
    submod_1 = self.submod_1(submod_0);  submod_0 = None
    submod_2 = self.submod_2(submod_1)
    return pytree.tree_unflatten((submod_1, submod_2), self._out_spec)

# submod_0
def forward(self, arg0_1):
    add = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    sin = torch.ops.aten.sin.default(add);  add = None
    sum_1 = torch.ops.aten.sum.default(sin);  sin = None
    return sum_1

# submod_1
def forward(self, sum_1):
    _set_grad_enabled = torch._C._set_grad_enabled(False)
    add_1 = torch.ops.aten.add.Tensor(sum_1, 1);  sum_1 = None
    return add_1

# submod_2
def forward(self, add_1):
    _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
    sub = torch.ops.aten.sub.Tensor(add_1, 1);  add_1 = None
    return sub

```

Note that currently, we call split_module on the graph after pre-dispatch aot. The difference is even larger if we `split_module` the graph module produced by dynamo, where all the original variables names in user program are preserved after dynamo but  lost after `split_module` without this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119732
Approved by: https://github.com/tugsbayasgalan
2024-02-17 02:18:04 +00:00
becfda005e tiny improvement to the cprofile wrapper (#120100)
1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ...
2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120100
Approved by: https://github.com/eellison
2024-02-17 02:10:25 +00:00
36e118b810 [inductor] logging meta data for inductor generated triton kernel (#120048)
I want to log metadata for inductor generated triton kernels for a couple of purposes
1. with these metadata, it should be convenient to find unaligned reduction kernels and try the idea here https://github.com/pytorch/pytorch/issues/119929 . I think it's nice to try on kernels that are used in real models
2. I'm thinking that based on the collected kernel metadata, I can build a simple offline tool by benchmarking each kernel with ncu and augment each kernel metadata with: latency, theoretical membw (estimated memory access / latency), and actually achieved membw. Hopefully this can point us to some good optimization opportunities.

Command:
```
TORCHINDUCTOR_CACHE_DIR=`realpath ~/inductor-caches/kernel-metadata-log` TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training
```

The best practice here is to point inductor cache to a folder outside of /tmp so that one can always run the kernel again based on the path stored in kernel metadata. (folders under /tmp may get removed by the system)

Here is first 1000 rows of collected metadata for huggingface: https://gist.github.com/shunting314/cf4ebdaaaa7e852efcaa93524c868e5f

And here is the total 10K kernels collected for huggingface. The gist can not be rendered as a csv since it's too large: https://gist.github.com/shunting314/7f841528e2debdc2ae05dece4ac591be .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120048
Approved by: https://github.com/jansel
2024-02-17 02:09:27 +00:00
24968ff042 Add quantized gelu (#119935)
Summary: Added Quantized gelu for vulkan backend.

Test Plan:
**Tested it on "On Demand RL FBSource"**

LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_quantized_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="VulkanAPITest.gelu_q*"

----------------------------------------------------------------------------------

Note: Google Test filter = VulkanAPITest.gelu_q*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gelu_qint8
[       OK ] VulkanAPITest.gelu_qint8 (318 ms)
[ RUN      ] VulkanAPITest.gelu_qint8_self
[       OK ] VulkanAPITest.gelu_qint8_self (214 ms)
[ RUN      ] VulkanAPITest.gelu_quint8
[       OK ] VulkanAPITest.gelu_quint8 (152 ms)
[ RUN      ] VulkanAPITest.gelu_quint8_self
[       OK ] VulkanAPITest.gelu_quint8_self (142 ms)
[----------] 4 tests from VulkanAPITest (828 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (828 ms total)
[  PASSED  ] 4 tests.

Differential Revision: D52985437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119935
Approved by: https://github.com/jorgep31415
2024-02-17 01:17:25 +00:00
7973ac586d [Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404)
Summary:
Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables:

```
  double garbage_collection_threshold;
  size_t max_split_size;
  size_t pinned_num_register_threads;
  bool expandable_segments;
  bool release_lock_on_cudamalloc;
  bool pinned_use_cuda_host_register;
  std::string last_allocator_settings;
  std::vector<size_t> roundup_power2_divisions;
```

Test Plan:
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True',
 'max_split_size': -1,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': True,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 0,
  '2': 0,
  '4': 0,
  '8': 0,
  '16': 0,
  '32': 0,
  '64': 0,
  '128': 0,
  '256': 0,
  '512': 0,
  '1024': 0,
  '2048': 0,
  '4096': 0,
  '8192': 0,
  '16384': 0,
  '32768': 0}}
```
`PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]',
 'max_split_size': 2097152000,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': False,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8}
}
```

Differential Revision: D53536199

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404
Approved by: https://github.com/zdevito
2024-02-17 01:16:37 +00:00
9aa8bbf7f2 [BE] Delete C10_IS_TRIVIALLY_COPYABLE (#120120)
It's not used anywhere in PyTorch after custom implementation of `c10::optional` is gone, and it's not used by the repo as well, see https://github.com/search?type=code&q=C10_IS_TRIVIALLY_COPYABLE+org%3Apytorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120120
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/huydhn
2024-02-17 01:04:30 +00:00
79569d117d Add hpu device support in storage/resize (#119761)
Add hpu device to
 - In storage method resize_
 - is_supported_device for fsdp
 - for storage add hpu device support

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119761
Approved by: https://github.com/mikaylagawarecki
2024-02-17 01:04:27 +00:00
6b63d3bac9 [ONNX][dynamo_export] Adjust to new symbolic shape name format in value_info (#119855)
Bump onnxscript in CI and adjust the test case expectation of the experimental exported shape naming format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119855
Approved by: https://github.com/thiagocrepaldi
2024-02-17 00:51:19 +00:00
cyy
e61c8ef3aa Simplify c10::is_pod implementation and remove unneeded inclusion of C++17.h (#118212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118212
Approved by: https://github.com/albanD
2024-02-17 00:14:09 +00:00
cyy
6952d6ddad [structural binding][4/N] Replace std::tie with structural binding (#120039)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120039
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-02-17 00:05:58 +00:00
761fa5d6ec Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
There are two scenarios:

* Scenario 1: The checkpoint was saved with pytorch < 1.6
* Scenario 2: The checkpoint was saved with pytorch >= 1.6

Repro Scenario 1:

```python
from torch._subclasses import fake_tensor
import transformers

fake_mode = fake_tensor.FakeTensorMode()
with fake_mode:
    fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")
```

Error:

```bash
Some weights of the model checkpoint at sshleifer/tiny-gpt2 were not used when initializing GPT2Model: ['lm_head.weight']
- This IS expected if you are initializing GPT2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:463 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    460 │   │   │   )                                                                             │
│    461 │   │   return safe_load_file(checkpoint_file)                                            │
│    462 │   try:                                                                                  │
│ ❱  463 │   │   return torch.load(checkpoint_file, map_location="cpu")                            │
│    464 │   except Exception as e:                                                                │
│    465 │   │   try:                                                                              │
│    466 │   │   │   with open(checkpoint_file) as f:                                              │
│                                                                                                  │
│ /opt/pytorch/torch/serialization.py:1030 in load                                                 │
│                                                                                                  │
│   1027 │   │   │   │   return _legacy_load(opened_file, map_location, _weights_only_unpickler,   │
│   1028 │   │   │   except RuntimeError as e:                                                     │
│   1029 │   │   │   │   raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None           │
│ ❱ 1030 │   │   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args  │
│   1031                                                                                           │
│   1032                                                                                           │
│   1033 # Register pickling support for layout instances such as                                  │
│                                                                                                  │
│ /opt/pytorch/torch/serialization.py:1258 in _legacy_load                                         │
│                                                                                                  │
│   1255 │   _sys_info = pickle_module.load(f, **pickle_load_args)                                 │
│   1256 │   unpickler = UnpicklerWrapper(f, **pickle_load_args)                                   │
│   1257 │   unpickler.persistent_load = persistent_load                                           │
│ ❱ 1258 │   result = unpickler.load()                                                             │
│   1259 │                                                                                         │
│   1260 │   deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)                 │
│   1261                                                                                           │
│                                                                                                  │
│ /opt/pytorch/torch/_utils.py:201 in _rebuild_tensor_v2                                           │
│                                                                                                  │
│   198 def _rebuild_tensor_v2(                                                                    │
│   199 │   storage, storage_offset, size, stride, requires_grad, backward_hooks, metadata=None    │
│   200 ):                                                                                         │
│ ❱ 201 │   tensor = _rebuild_tensor(storage, storage_offset, size, stride)                        │
│   202 │   tensor.requires_grad = requires_grad                                                   │
│   203 │   if metadata:                                                                           │
│   204 │   │   set_tensor_metadata(tensor, metadata)                                              │
│                                                                                                  │
│ /opt/pytorch/torch/_utils.py:180 in _rebuild_tensor                                              │
│                                                                                                  │
│   177 def _rebuild_tensor(storage, storage_offset, size, stride):                                │
│   178 │   # first construct a tensor with the correct dtype/device                               │
│   179 │   t = torch.tensor([], dtype=storage.dtype, device=storage._untyped_storage.device)      │
│ ❱ 180 │   return t.set_(storage._untyped_storage, storage_offset, size, stride)                  │
│   181                                                                                            │
│   182                                                                                            │
│   183 def get_tensor_metadata(tensor):                                                           │
│                                                                                                  │
│ /opt/pytorch/torch/utils/_stats.py:20 in wrapper                                                 │
│                                                                                                  │
│   17 │   │   if fn.__qualname__ not in simple_call_counter:                                      │
│   18 │   │   │   simple_call_counter[fn.__qualname__] = 0                                        │
│   19 │   │   simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1     │
│ ❱ 20 │   │   return fn(*args, **kwargs)                                                          │
│   21 │   return wrapper                                                                          │
│   22                                                                                             │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1160 in __torch_dispatch__                         │
│                                                                                                  │
│   1157 │   def __torch_dispatch__(self, func, types, args=(), kwargs=None):                      │
│   1158 │   │   assert self not in _get_current_dispatch_mode_stack(), func                       │
│   1159 │   │   try:                                                                              │
│ ❱ 1160 │   │   │   return self.dispatch(func, types, args, kwargs)                               │
│   1161 │   │   except TypeError:                                                                 │
│   1162 │   │   │   log.exception("fake tensor raised TypeError")                                 │
│   1163 │   │   │   raise                                                                         │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1318 in dispatch                                   │
│                                                                                                  │
│   1315 │   │                                                                                     │
│   1316 │   │   # we are falling through to running non constant tensors, any input constant tha  │
│   1317 │   │   # is written to must be invalidated                                               │
│ ❱ 1318 │   │   self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)   │
│   1319 │   │                                                                                     │
│   1320 │   │   # Try for fastpath                                                                │
│   1321 │   │   if has_symbolic_sizes:                                                            │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1557 in invalidate_written_to_constants            │
│                                                                                                  │
│   1554 │   │   any_constant = any(e.constant is not None for e in flat_arg_fake_tensors)         │
│   1555 │   │   if any_constant and get_schema_info(func).is_mutable():                           │
│   1556 │   │   │   schema_info = get_schema_info(func)                                           │
│ ❱ 1557 │   │   │   _, new_kwargs = normalize_function(                                           │
│   1558 │   │   │   │   func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True         │
│   1559 │   │   │   )                                                                             │
│   1560 │   │   │   for k, v in new_kwargs.items():                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:297 in normalize_function                              │
│                                                                                                  │
│   294 │   │   new_args_and_kwargs = _args_kwargs_to_normalized_args_kwargs(sig, args, kwargs,    │
│   295 │   else:                                                                                  │
│   296 │   │   assert callable(target)                                                            │
│ ❱ 297 │   │   torch_op_schemas = get_signature_for_torch_op(target)                              │
│   298 │   │   matched_schemas = []                                                               │
│   299 │   │   if torch_op_schemas:                                                               │
│   300 │   │   │   # Iterate through all of the schema until we find one that matches             │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:167 in get_signature_for_torch_op                      │
│                                                                                                  │
│   164 │   │   │   return (None, None) if return_schemas else None                                │
│   165 │   │   schemas = torch._C._jit_get_schemas_for_operator(aten_fn)                          │
│   166 │                                                                                          │
│ ❱ 167 │   signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]          │
│   168 │   return (signatures, schemas) if return_schemas else signatures                         │
│   169                                                                                            │
│   170 @compatibility(is_backward_compatible=False)                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:167 in <listcomp>                                      │
│                                                                                                  │
│   164 │   │   │   return (None, None) if return_schemas else None                                │
│   165 │   │   schemas = torch._C._jit_get_schemas_for_operator(aten_fn)                          │
│   166 │                                                                                          │
│ ❱ 167 │   signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]          │
│   168 │   return (signatures, schemas) if return_schemas else signatures                         │
│   169                                                                                            │
│   170 @compatibility(is_backward_compatible=False)                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:70 in _torchscript_schema_to_signature                 │
│                                                                                                  │
│    67 │   from inspect import Parameter                                                          │
│    68 │   parameters : List[Parameter] = []                                                      │
│    69 │   for arg in ts_schema.arguments:                                                        │
│ ❱  70 │   │   arg_type = _torchscript_type_to_python_type(arg.type)                              │
│    71 │   │   default = arg.default_value if arg.has_default_value() else Parameter.empty        │
│    72 │   │   # TODO: Figure out if this is safe. It seems like when generating the type signa   │
│    73 │   │   # PythonArgParser, we emit signatures with `input` instead of `self` as the firs   │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:64 in _torchscript_type_to_python_type                 │
│                                                                                                  │
│    61 │   eval'ing the annotation_str. _type_eval_globals sets up expressions                    │
│    62 │   like "List" and "Future" to map to actual types (typing.List and jit.Future)           │
│    63 │   """                                                                                    │
│ ❱  64 │   return eval(ts_type.annotation_str, _type_eval_globals)                                │
│    65                                                                                            │
│    66 def _torchscript_schema_to_signature(ts_schema : torch._C.FunctionSchema) -> inspect.Sig   │
│    67 │   from inspect import Parameter                                                          │
│ <string>:1 in <module>                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'Storage' is not defined

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:467 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    464 │   except Exception as e:                                                                │
│    465 │   │   try:                                                                              │
│    466 │   │   │   with open(checkpoint_file) as f:                                              │
│ ❱  467 │   │   │   │   if f.read(7) == "version":                                                │
│    468 │   │   │   │   │   raise OSError(                                                        │
│    469 │   │   │   │   │   │   "You seem to have cloned a repository without having git-lfs ins  │
│    470 │   │   │   │   │   │   "git-lfs and run `git lfs install` followed by `git lfs pull` in  │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/codecs.py:322 in decode                                       │
│                                                                                                  │
│    319 │   def decode(self, input, final=False):                                                 │
│    320 │   │   # decode input (taking the buffer into account)                                   │
│    321 │   │   data = self.buffer + input                                                        │
│ ❱  322 │   │   (result, consumed) = self._buffer_decode(data, self.errors, final)                │
│    323 │   │   # keep undecoded input until the next call                                        │
│    324 │   │   self.buffer = data[consumed:]                                                     │
│    325 │   │   return result                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/pytorch/bug_repro.py:16 in <module>                                                         │
│                                                                                                  │
│   13 fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")                  │
│   14 assert fake_model is not None                                                               │
│   15 with fake_mode:                                                                             │
│ ❱ 16 │   fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")  # raises    │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:484 in │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   481 │   │   │   )                                                                              │
│   482 │   │   elif type(config) in cls._model_mapping.keys():                                    │
│   483 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
│ ❱ 484 │   │   │   return model_class.from_pretrained(                                            │
│   485 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
│   486 │   │   │   )                                                                              │
│   487 │   │   raise ValueError(                                                                  │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:2604 in          │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   2601 │   │   if from_pt:                                                                       │
│   2602 │   │   │   if not is_sharded and state_dict is None:                                     │
│   2603 │   │   │   │   # Time to load the checkpoint                                             │
│ ❱ 2604 │   │   │   │   state_dict = load_state_dict(resolved_archive_file)                       │
│   2605 │   │   │                                                                                 │
│   2606 │   │   │   # set dtype to instantiate the model under:                                   │
│   2607 │   │   │   # 1. If torch_dtype is not None, we use that dtype                            │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:479 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    476 │   │   │   │   │   │   "model. Make sure you have saved the model properly."             │
│    477 │   │   │   │   │   ) from e                                                              │
│    478 │   │   except (UnicodeDecodeError, ValueError):                                          │
│ ❱  479 │   │   │   raise OSError(                                                                │
│    480 │   │   │   │   f"Unable to load weights from pytorch checkpoint file for '{checkpoint_f  │
│    481 │   │   │   │   f"at '{checkpoint_file}'. "                                               │
│    482 │   │   │   │   "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please s  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: Unable to load weights from pytorch checkpoint file for '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin' at
'/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set
from_tf=True.
```

Repro scenario 2:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during torch.load (when fake mode is active) by changing the storage's device to 'meta'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang, https://github.com/atalman
2024-02-16 23:42:50 +00:00
7ad4ab4765 Remove unused import (#120004)
Summary: Title

Test Plan: CI

Differential Revision: D53820298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120004
Approved by: https://github.com/zhxchen17, https://github.com/Skylion007
2024-02-16 22:00:44 +00:00
7b1f5c874f [PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745)
Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed

Test Plan:
timeout flows:
f528209775
f530084719

Differential Revision: D53692344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745
Approved by: https://github.com/jackiexu1992
2024-02-16 21:32:04 +00:00
006eead7d2 [dynamo][functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683)
Differential Revision: [D53758434](https://our.internmc.facebook.com/intern/diff/D53758434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119683
Approved by: https://github.com/ezyang
2024-02-16 21:28:39 +00:00
4f4629d522 [Dynamo] Fix ListIteratorVariable repr to avoid log flooding (#120053)
This issue was found from Meta internal use case.
Before:
```
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1)
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0]         a = [sum(x) for x in result]
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 []
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()]
V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=0)]
V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), LazyVariableTracker()]
V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)]
V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum)]
V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum), ListVariable()]
V0215 18:33:41.764000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), ConstantVariable(int: 50)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), LazyVariableTracker()]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum)]
V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum), ListVariable()]
V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), ConstantVariable(int: 68)]
V0215 18:33:41.767000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)]
```
After:
```
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1)
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0]         a = [sum(x) for x in result]
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 []
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()]
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=0)]
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), LazyVariableTracker()]
V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=1)]
V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum)]
V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum), ListVariable()]
V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=1), ConstantVariable(int: 55)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=1)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=1)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), LazyVariableTracker()]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=2)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum), ListVariable()]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=2), ConstantVariable(int: 64)]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=2)]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=2)]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), LazyVariableTracker()]
V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=3)]
V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum)]
V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum), ListVariable()]
V0215 18:27:57.907000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=3), ConstantVariable(int: 56)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120053
Approved by: https://github.com/williamwen42
2024-02-16 21:19:37 +00:00
26343451be DTensor: make tensor_flatten more compatible for dynamo getattr (#118209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118209
Approved by: https://github.com/ezyang, https://github.com/wanchaol
ghstack dependencies: #117667, #117666
2024-02-16 21:16:07 +00:00
ee7bcf23db dynamo: support attribute access on tensor subclasses without sources (#117666)
Fixes https://github.com/pytorch/pytorch/issues/117596

This was needed for Float8Tensor. Before this PR, dynamo would sometimes handle attribute access on tensor subclasses correctly, but it would choke on tensor subclasses with no source (it would fall back to using a `GetAttrVariable` to represent the attribute access, which is a problem if the attribute is a tensor that we later want to call tensor methods on).

I supported two cases:

(1) the attribute is a tensor, which is part of the `attrs` returned by the subclass's `__tensor_flatten__`. This creates a `TensorVariable`
(2) the attribute is a constant, which is part of the constant metadata returned by `__tensor_flatten__`. As per the contract of tensor_flatten, this should be a `ConstantVariable`. It could be possible that we allow non-constant metadata in the future, but we don't support that today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117666
Approved by: https://github.com/zou3519
ghstack dependencies: #117667
2024-02-16 21:16:07 +00:00
67f6aca0d0 dynamo: respect autograd.Function + multiple save_for_backward calls (#117667)
Fixes https://github.com/pytorch/pytorch/issues/117652. Corner case that I hit debugging some Float8 issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117667
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-02-16 21:16:07 +00:00
4ac857f94e Support broadcast in native funcol (#119229)
### Summary

@LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol.

- Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_`
- Integrated with python functol broadcast and `AsyncCollectiveTensor`
- Implemented Inductor lowering. Verified correctness and buffer reuse behavior
- Validated dynamo traceability
- Validated AOTInductor compile-ability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229
Approved by: https://github.com/wanchaol
ghstack dependencies: #119104
2024-02-16 21:01:34 +00:00
24d5caba6e [EZ] Fix argument parsing in build_with_debinfo (#120088)
`nargs="?"` accept 0 or 1 argument, but `nargs="*"` accepts 0 or any number of arguments, which is the intended behavior of the tool

Test plan: Run `python tools/build_with_debinfo.py aten/src/ATen/native/cpu/BlasKernel.cpp aten/src/ATen/native/BlasKernel.cpp` and observe that it generates torch_cpu with those two files containing debug information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120088
Approved by: https://github.com/Skylion007
2024-02-16 20:06:52 +00:00
2d4aa91a10 Fix searchsorted function signature in docs (#120086)
Side should be optional string, to match definition in native_functions: fbe8e0f92d/aten/src/ATen/native/native_functions.yaml (L11246)

Fixes https://github.com/pytorch/pytorch/issues/119999

Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/120086/generated/torch.searchsorted.html#torch-searchsorted

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120086
Approved by: https://github.com/lezcano
2024-02-16 20:00:04 +00:00
288d1f3698 [Optim][Rprop] Replace new().resize_as_() by torch.full_like() (#119978)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119978
Approved by: https://github.com/janeyx99
2024-02-16 19:54:04 +00:00
6ea4480818 [quant][pt2e] Add model_is_exported util function (#119726)
Summary: This commit adds the `model_is_exported` util function
for users to be able to easily tell what APIs to call to move
their models between train and eval modes. This has the
additional advantage of hiding the implementation of how we
detect a model is exported, in case the metadata format changes
in the future.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported

Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726
Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD
2024-02-16 19:29:36 +00:00
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
b97fa6ac30 Make roll a decomposition and remove its lowering (#119857)
We use the fact that we now propagate indexing properly to avoid having
to maintain two different implementations of the op. Doing this we also remove
a spurious guard on this op.

We move the ref into a decomp as we now use advanced indexing.
The only difference we did in the implementation is that we now use
advanced indexing rather than `torch.cat`.

We also remove it from core. Let's see how this goes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857
Approved by: https://github.com/peterbell10, https://github.com/larryliu0820
ghstack dependencies: #119863, #119864
2024-02-16 19:14:39 +00:00
8b02d64197 Correct index propagation for % (#119864)
The current index propagation transformed % into `fmod`. This was
incorrect. We perform the index propagation in the most common case,
when we it is correct to do it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119864
Approved by: https://github.com/peterbell10
ghstack dependencies: #119863
2024-02-16 19:14:39 +00:00
00524970e8 Simplify indexing when doing ModularIndexing + index propagation. (#119863)
We now avoid creating an unnecessary ternary operator in some reasonably
common case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119863
Approved by: https://github.com/peterbell10
2024-02-16 19:14:39 +00:00
86dedebeaf Revert "Add pixel_shuffle to core aten decomps (#119899)"
This reverts commit 9201d7335a25d9a91e10c1914c399419af0bd7c3.

Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))
2024-02-16 17:44:59 +00:00
b10ae9e54c [pytree] Properly register immutable collections (#120036)
Summary:
Getting error like:
```
No registered serialization name for <class 'torch.fx.immutable_collections.immutable_dict'> found. Please update your _register_pytree_node call with a `serialized_type_name` kwarg.
```

Reviewed By: suo

Differential Revision: D53833323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120036
Approved by: https://github.com/SherlockNoMad
2024-02-16 17:39:12 +00:00
124c251510 Guarantee init cuda before attaching hooks (#120052)
Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op.

Test Plan:
Testing this on fsdp+tp example model where cuda is not initialized before init_process_group.

Job without the fix keeps dynamically registering:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION
The following keeps looping:
[0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true
[0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000

Job with the fix does not have this issue:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION

Reviewed By: minsii, kwen2501, xw285cornell

Differential Revision: D53770989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052
Approved by: https://github.com/kwen2501
2024-02-16 17:36:53 +00:00
fbe8e0f92d Fix missing right square bracket to match glog format (#119966)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119966
Approved by: https://github.com/oulgen
ghstack dependencies: #119869
2024-02-16 15:14:00 +00:00
9726d7ca8e Add lowering for logcumsumexp (#118753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753
Approved by: https://github.com/peterbell10
ghstack dependencies: #119809
2024-02-16 14:04:38 +00:00
3f4dd9bfa4 Back out "[pytree] Require serialized_type_name" (#120041)
Summary:
D53785493 breaks apf.rec.ir.tests.ir_export_deserialize_test.IRExportDeserializeTest: test_export_deserialize_ebc failed:

https://www.internalfb.com/sandcastle/workflow/3436246515685789584

Test Plan: buck2 test mode/opt apf/rec/ir/tests:ir_export_deserialize_test

Differential Revision: D53834881

Co-authored-by: Wilson Hong <wilsonhong@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120041
Approved by: https://github.com/ydwu4
2024-02-16 10:02:25 +00:00
4625ecb858 Add decomp for linalg.cross (#119809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119809
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-02-16 09:58:38 +00:00
3693d8f467 Do to convert UnsupportedFakeTensorException into RuntimeError in runNode for proper graph breaking. (#120026)
Fix: https://github.com/pytorch/pytorch/issues/119779 by properly graph breaking  a proper fix is to handle quantized tensors for full complete solution.

if when generating  a fake tensor, UnsupportedFakeTensorException is thrown, then its handled and converted into a
Unimplemented in inside wrap_fake_exception which is then translated to a graph break.

However run_node used to convert  UnsupportedFakeTensorException into a runtime error, creating runtime
errors instead of graph breaks whenever generating a fake tensor for a quantized tensor fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120026
Approved by: https://github.com/jansel
2024-02-16 09:21:58 +00:00
54025c01a7 [DCP][state_dict] Let distributed_state_dict filter out the compiler prefix (#119830)
Let distributed_state_dict filter out the compiler prefix

Differential Revision: [D53681864](https://our.internmc.facebook.com/intern/diff/D53681864/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119830
Approved by: https://github.com/wz337
2024-02-16 08:59:58 +00:00
bc7f3efb09 [aot_inductor] move CppWrapperCodeGen into a separate file (#119871)
This reverts commit d8e319a961bb872027f0abdc413d6beb7502ac9b.

Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871
Approved by: https://github.com/albanD, https://github.com/khabinov
ghstack dependencies: #119870
2024-02-16 08:14:20 +00:00
78c9b2948a [aot_inductor] move CudaWrapperCodeGen into a separate file (#119870)
This reverts commit 3ab08946d5052eaeda11d683d6a58e801a032755.

Differential Revision: [D53817852](https://our.internmc.facebook.com/intern/diff/D53817852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119870
Approved by: https://github.com/khabinov
2024-02-16 08:10:51 +00:00
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
b8be8b639f Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823)
Summary:
1. Make sure folded constants generated internally doesn't get exposed.
2. Add runConstantFolding and related API calls

Test Plan:
```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=*PyTorchPredictorContainerTest.LoadAOTInductorModel*
```
The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`,
which would trigger runConstantFolding from predictor's module loading.

Reviewed By: SherlockNoMad

Differential Revision: D53718139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823
Approved by: https://github.com/chenyang78
2024-02-16 06:45:48 +00:00
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
02fb043522 Change native funcol inductor tests to use fake pg (#119104)
Summary:
Previously these tests require more than 2 GPUs to run. Changing them to use fake pg so they can run more often.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119104
Approved by: https://github.com/wconstab
ghstack dependencies: #119103
2024-02-16 05:18:45 +00:00
62e5840b36 [Dynamo] Do not create TorchInGraphFunctionVariable for tags (#120005)
Fixes https://github.com/pytorch/pytorch/issues/119793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120005
Approved by: https://github.com/yanboliang
2024-02-16 03:37:32 +00:00
ddde1e4dee [executorch hash update] update the pinned executorch hash (#119943)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119943
Approved by: https://github.com/pytorchbot
2024-02-16 03:36:56 +00:00
4eefe7285a Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)
Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
  union {
     uint16_t h;
     float16_t f16;
  } x = {h};
  return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en)

As results, very slow and naive [`torch.mm`](edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)) runs 3x faster: 85 msec before to 27 msec (measured by running e41341df2d/benchmarks/benchmark_torch_mm.py )

This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit

"Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)`  (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` )
But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012
Approved by: https://github.com/huydhn
2024-02-16 03:04:06 +00:00
3e5e8590f4 Account for inference mode in FakeTensor cache (#119963)
Summary: an fbcode test exposed a shortcoming where we serve a FakeTensor from the cache with the wrong inference_mode. Take the current mode into account in the cache key so we only serve entries from the same mode we're in currently

Test Plan: New unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119963
Approved by: https://github.com/eellison
2024-02-16 02:53:33 +00:00
8bfc87ce74 fixed flop counter formula for conv transposed backwards pass (#119874)
Fixes #119806
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119874
Approved by: https://github.com/zou3519
ghstack dependencies: #119521
2024-02-16 02:43:49 +00:00
17c345ebd9 [FSDP] compile compute and CI with @test_compiled_fsdp (#119933)
goal: all unit tests for eager. we want to test torch.compile by default

this PR adds ``@test_compiled_fsdp(compile_compute_on_module=None/TransformerBlock)`` to unit tests. now it's compiling compute-only as follows.

```
module.compile() # include user registered hooks if any
fully_shard(module)
```

torch.compile does not work following component yet
* compiling AC
* compiling reshard_after_forward=2
* delayed_all_gather, delayed_reduce_scatter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119933
Approved by: https://github.com/awgu, https://github.com/jansel
2024-02-16 01:48:51 +00:00
c802c50196 Setup Nvidia Runtime before Indexer (#119923)
Sets up Nvidia Runtime and runs indexer inside a docker container.

Verified this works by running the indexer jobs (all the setup is correct, it OOMs for an unrelated reason, for which a fix is on the way).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119923
Approved by: https://github.com/huydhn
2024-02-16 00:33:18 +00:00
4319735ace Add meta registration for _foreach_norm (2nd try) (#119927)
The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927
Approved by: https://github.com/albanD
2024-02-16 00:23:23 +00:00
707cde9b31 [DTensor][Test] Improve math_ops test (#118956)
The DTensor fully_shard_tensor was created but not used in shard_math_ops test previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118956
Approved by: https://github.com/wanchaol
2024-02-15 23:59:25 +00:00
cyy
94f19fe545 [3/N] Replace std::tie with structural binding (#119962)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119962
Approved by: https://github.com/albanD
2024-02-15 23:48:28 +00:00
2a63dd8889 [Dynamo] Support lazy module with namedtuple/dict input (#119972)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119972
Approved by: https://github.com/jansel
2024-02-15 23:18:18 +00:00
f9f602fcb8 Clean up decorators (#119925)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119925
Approved by: https://github.com/eellison
2024-02-15 22:51:53 +00:00
444c628e06 Include the scalar tensor auto-transfer in the doc (#119967)
Fixes #119609

@albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967
Approved by: https://github.com/albanD
2024-02-15 22:37:39 +00:00
47300221c2 Revert "[export] Change runtime asserts to using assert_scalar (#119608)"
This reverts commit f4d641ba2fb11fca2ba47f0c425d8a4a1adbffb6.

Reverted https://github.com/pytorch/pytorch/pull/119608 on behalf of https://github.com/huydhn due to This break ONNX trunk job 65fd8b6730 ([comment](https://github.com/pytorch/pytorch/pull/119608#issuecomment-1947436402))
2024-02-15 22:25:24 +00:00
da1df5d7b8 [ROCm] Update triton wheels to ROCm 6.0 (#119765)
Upgrades nightly triton issues to ROCM 6.0 and adds bitcodes for gfx941 and gfx942.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119765
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2024-02-15 21:57:51 +00:00
3f4f91f2eb [inductor][eazy] fix profiler (#119959)
print_performance previously returns the execution time for `times` runs in total but now it returns the average execution time of a single run.  Change the profiler to be consistent with that. Not sure if there is a good way to add test though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119959
Approved by: https://github.com/eellison
2024-02-15 21:47:09 +00:00
65fd8b6730 Revert "[export] Disable exported_program.__call__ (#119466)"
This reverts commit c26884f06345bf61e0843d13db84e76236ff6142.

Reverted https://github.com/pytorch/pytorch/pull/119466 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119466#issuecomment-1947384298))
2024-02-15 21:42:32 +00:00
744898b311 Add doc page for environment variables that effect PyTorch Runtime (#119087)
# Summary

The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087
Approved by: https://github.com/janeyx99, https://github.com/eqy
2024-02-15 21:41:38 +00:00
d707e3c9c6 Fix handling none source in build_torch_function_fn (#119724)
Fix https://github.com/pytorch/pytorch/issues/119580

When a UserDefinedObjectVariable is created it does not always have a source, i.e: when its an intermediate
This diff fix two handling of none source in two locations during an inlining of a user torch function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119724
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2024-02-15 21:21:47 +00:00
9548860b37 Fix typo in istft docstring (#119776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119776
Approved by: https://github.com/colesbury
2024-02-15 21:20:00 +00:00
a2f07bb317 Fix typo under docs directory (#119657)
This PR fixes typo under `docs` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657
Approved by: https://github.com/colesbury
2024-02-15 21:14:34 +00:00
2d7a395c0f Fix typo in functional.py (#119775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119775
Approved by: https://github.com/colesbury
2024-02-15 21:14:29 +00:00
c3b4d78e17 [Dynamo][Easy] Fix a small bug in test_trace_rules.py (#119973)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119973
Approved by: https://github.com/zou3519
2024-02-15 20:44:32 +00:00
b4c7afe101 [pytree] Require serialized_type_name (#119718)
Differential Revision: [D53785493](https://our.internmc.facebook.com/intern/diff/D53785493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119718
Approved by: https://github.com/suo
2024-02-15 20:32:44 +00:00
f32560c939 Remove Redundant Bullet Point (#120007)
Fast path explanation for scaled_dot_product_attention in nn.MultiHeadAttention mentioned inputs being batched with batch_first = True twice.  Removed the second mention of this requirement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120007
Approved by: https://github.com/mikaylagawarecki
2024-02-15 19:47:35 +00:00
605de946cf Clarify the patience in ReduceLROnPlateau (#119872)
Fixes #119763
@janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119872
Approved by: https://github.com/janeyx99
2024-02-15 19:43:06 +00:00
26b6de43e5 Revert "Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)" (#120001)
This reverts commit d833e2f2364a01c6fdab689a8bb5bbf55a5b60f7.

This is failing some RL builds internally using clang 13 D53791577

https://github.com/pytorch/pytorch/pull/119895#issuecomment-1946859332.  The bot doesn't like a commit being merged into the stack base and fails to revert the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120001
Approved by: https://github.com/malfet
2024-02-15 19:41:51 +00:00
9b6fae2d79 Tweak to pr#119719 - eager & fullgraph (#119921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119921
Approved by: https://github.com/oulgen
2024-02-15 19:31:56 +00:00
01ee85c8ab [PyTorch][Vulkan]remove redundant test of log_softmax (#119964)
Summary: `vulkan_api_test.cpp` already has [a test for `log_softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4521), so we remove the redundant `DISABLED_log_softmax`. According to the comment the test was disabled because "the op is not working correctly. Add it back when it is fixed." Actually it's a simple typo mistake: the [CPU output should use `at::log_softmax` instead of `at::softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4548). Since we already have a test for `log_softmax`, the fix isn't necessary and we remove this disabled test.

Test Plan:
Full vulkan_api_test P1184744699:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin
...
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 427 tests from VulkanAPITest (23633 ms total)

[----------] Global test environment tear-down
[==========] 427 tests from 1 test suite ran. (23634 ms total)
[  PASSED  ] 426 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```

Reviewed By: jorgep31415

Differential Revision: D53766200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119964
Approved by: https://github.com/jorgep31415
2024-02-15 19:16:56 +00:00
8835ff1b09 [AMD] Update hipify code to oss (#119958)
Summary: Syncing the hipify code to third party. Trunk was broken by multiple diffs D53716382 D53744795

Test Plan: sandcastle

Differential Revision: D53790854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119958
Approved by: https://github.com/jianyuh, https://github.com/drisspg
2024-02-15 19:14:34 +00:00
143b5f2745 Fix the missing device in _memory_profiler (#119751)
Fixes #119722,
1, added the missing device in
```
max_memory_allocated = torch.cuda.max_memory_allocated()
max_memory_reserved = torch.cuda.max_memory_reserved()
```
2, fix the device parameter to device_str. Based on [lines](2bda6b4cb8/torch/profiler/profiler.py (L291)), the input device are a string (device_str) for
```
self.mem_tl.export_memory_timeline_html
self.mem_tl.export_memory_timeline_raw
self.mem_tl.export_memory_timeline
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119751
Approved by: https://github.com/aaronenyeshi
2024-02-15 19:11:15 +00:00
98fd23cccc [EASY] Move OpsHandler and MockHandler to their own file (#119851)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119851
Approved by: https://github.com/lezcano
ghstack dependencies: #119728
2024-02-15 18:54:41 +00:00
6f324e8776 [ATen] Tag isinf as a pointwise op (#119728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119728
Approved by: https://github.com/lezcano
2024-02-15 18:54:41 +00:00
eqy
e386bfa688 [CUDA][cuSPARSE] Work around IMA in cuSPARSE ALG1 on SM 8.9 devices (#119610)
Originally surfaced from the discuss forum:
https://discuss.pytorch.org/t/issue-with-torch-sparse-mm-while-running-on-gpu/188669

This has been forwarded to cuSPARSE but we have not yet received a commitment on their end to fix this issue directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119610
Approved by: https://github.com/jeffdaily, https://github.com/jcaip
2024-02-15 18:28:45 +00:00
2429495820 [FSDP2][ez] Made typing more strict to avoid cast (#119985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119985
Approved by: https://github.com/Skylion007, https://github.com/fegin
ghstack dependencies: #118298
2024-02-15 18:20:35 +00:00
840426e793 [export] Log export time. (#119960)
Summary: as title. we are logging the time to complete one export session.

Test Plan: CI

Differential Revision: D53737766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119960
Approved by: https://github.com/angelayi
2024-02-15 18:04:15 +00:00
9b38ee2343 Revert "Alternate sharding (#119078)"
This reverts commit 861acda20577739d52dd0bcf09e162192f25020f.

Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing 861acda205 ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))
2024-02-15 16:59:50 +00:00
a83a1bc43b Adding c10 device type to newly added DeviceAccelerator (#119961)
Follow up to https://github.com/pytorch/pytorch/pull/104364,

A new file got submitted yesterday that is using DeviceType without the c10 namespace. This fixes that. I haven't yet figured out a way to setup a test for this, but I will submit a follow up PR once I figure that out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119961
Approved by: https://github.com/ezyang
2024-02-15 14:56:05 +00:00
e5bfdde7ba Fix the skip condition for test_c10d tests (#119938)
Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU.

```
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
referring to https://github.com/pytorch/pytorch/pull/84980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119938
Approved by: https://github.com/eqy, https://github.com/fegin
2024-02-15 11:03:39 +00:00
c26884f063 [export] Disable exported_program.__call__ (#119466)
Summary: `ExportedProgram` is an artifact produced by torch.export, containing the graph that is exported, along with other attributes about the original program such as the graph signature, state dict, and constants. One slightly confusing thing that users run into is that they treat the `ExportedProgram` as a `torch.nn.Module`, since the object is callable. However, as we do not plan to support all features that `torch.nn.Module`s have, like hooks, we want to create a distinction between it and the `ExportedProgram` by removing the `__call__` method. Instead users can create a proper `torch.nn.Module` through `exported_program.module()` and use that as a callable.

Test Plan: CI

Differential Revision: D53075378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119466
Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi
2024-02-15 08:49:34 +00:00
f4d641ba2f [export] Change runtime asserts to using assert_scalar (#119608)
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.

https://github.com/pytorch/pytorch/issues/119587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
2024-02-15 07:13:42 +00:00
c83af673bc Allow CUDA extension builds to skip generating cuda dependencies during compile time (#119936)
nvcc flag `--generate-dependencies-with-compile` doesn't seem to be supported by `sccache` for now. Builds with this flag enabled will not benefit from sccache.

This PR adds an environment variable that allows users to set this flag and skip those nvcc dependencies to speed up their build with compiler caches. If everything is "fresh build" in CI, we don't care if there are unnecessary recompile during incremental builds.

related: https://github.com/pytorch/pytorch/pull/49344

- [ ] todo: raise an issue to sccache

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119936
Approved by: https://github.com/ezyang
2024-02-15 07:03:59 +00:00
cyy
d4882e438a [DeviceIndex][5/N] Use DeviceIndex in more places (#119866)
This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866
Approved by: https://github.com/Skylion007
2024-02-15 07:01:43 +00:00
cyy
68328ad394 Check existence of caffe2::mkl target (#119945)
Fixes #118862
If libtorch is included multiply times in different sub-folders, linking caffe2::mkl may incur errors like
```
  Cannot specify link libraries for target "caffe2::mkl" which is not built
  by this project.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119945
Approved by: https://github.com/ezyang
2024-02-15 06:28:17 +00:00
0898ead2d5 Timestamp Embedding Indices Generated for TD (#119955)
Timestamps the generated embedding indices. Moves the old indices to an `archived/` folder and then uploads the index to a `latest/` folder. There will be a short period in between these operations where there is no index in `latest/`. To handle this case, any workflow fetching the index (such as the retriever) should use a retry with backoff when copying from S3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119955
Approved by: https://github.com/huydhn
2024-02-15 04:48:40 +00:00
af346df6a0 [PyTorch][Vulkan]fix the issue of log 0 after softmax (#119898)
Summary: In some cases the output of `softmax` are so small that they are below the float16 precision. These values are represented as 0 in float16 and result in `-inf` when log is applied. According to [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format#Exponent_encoding), the minimum strictly positive (subnormal) value is 2^−24 ≈ 5.9605 × 10^−8. Therefore, we add 6 x 10^-8 to the output of softmax to avoid the numerical issue.

Test Plan:
We add two tests:
- `log_softmax_underflow_exception` tests the log_softmax without adding epsilon to the output of softmax, so we expect to get nan or -inf. (**NOTE**: this test has passed on both devserver and on Android device, but failed on the `
fbsource//xplat/caffe2:vulkan_ops_testAndroid` test on CI. In this test, `log` of small numbers [even `log 0` shows output -88 instead of `-inf`](https://interncache-cco.fbcdn.net/v/t49.3276-7/379414752_342395058779076_6447867753374424757_n.txt?ccb=1-7&_nc_sid=ce8ad4&efg=eyJ1cmxnZW4iOiJwaHBfdXJsZ2VuX2NsaWVudC9pbnRlcm4vc2l0ZS94L3Rlc3RpbmZyYSJ9&_nc_ht=interncache-cco&oh=00_AfApTdId1WOHUqdoSTc66s6adnrQt1YS0NDT-LDppIvX0g&oe=65D0CC99). We cannot reproduce this error on device now, so we **DISABLE** this test for now to integrate into CI.)
- `log_softmax_underflow` tests the updated implementation of log_softmax, nan and -inf have been removed

## test on devserver

```
luwei@devbig984.prn1 /data/users/luwei/fbsource (9f6b78894)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*log_softmax_underflow*"
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/baaaa683-60da-4dd8-95b9-6848fe1d7d74
Network: Up: 53KiB  Down: 1.4MiB  (reSessionID-9580ce4f-7e1e-4c65-8497-52443329b796)
Jobs completed: 6. Time elapsed: 24.2s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log_softmax_underflow*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception
[ RUN      ] VulkanAPITest.log_softmax_underflow
[       OK ] VulkanAPITest.log_softmax_underflow (169 ms)
[----------] 1 test from VulkanAPITest (169 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (169 ms total)
[  PASSED  ] 1 test.

  YOU HAVE 1 DISABLED TEST
```

full test results: P1184164670
```
[----------] 428 tests from VulkanAPITest (21974 ms total)

[----------] Global test environment tear-down
[==========] 428 tests from 1 test suite ran. (21974 ms total)
[  PASSED  ] 427 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 11 DISABLED TESTS
```

## test on device:
- build
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0  --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid  --show-output
```
- push to device and run
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid --gtest_filter="*log_softmax_underflow*"
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log_softmax_underflow*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception
[ RUN      ] VulkanAPITest.log_softmax_underflow
[       OK ] VulkanAPITest.log_softmax_underflow (292 ms)
[----------] 1 test from VulkanAPITest (293 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (294 ms total)
[  PASSED  ] 1 test.

  YOU HAVE 1 DISABLED TEST

```

Reviewed By: yipjustin

Differential Revision: D53694989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119898
Approved by: https://github.com/jorgep31415
2024-02-15 03:59:44 +00:00
cd08dc37f8 Support tracing native functional collective via python APIs (#119103)
Summary:
- Inlined `torch.distributed.distributed_c10d._get_group_size_by_name`
- Updated all torch.compile tests in test_c10d_functional_native.py to use funcol python APIs (as opposed to the dispatcher ops)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119103
Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/wanchaol
2024-02-15 03:33:49 +00:00
cyy
5f9b432494 [2/N] Replace std::tie with structural binding (#119879)
This PR follows #119774, Python generated code was changed to use structural binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119879
Approved by: https://github.com/albanD
2024-02-15 02:56:34 +00:00
9ff9798716 Fix a bug in kernel analysis with ttir defined args (#119934)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119934
Approved by: https://github.com/aakhundov
2024-02-15 02:49:11 +00:00
7f5b87c953 [torch.compile] Log more compilation time breakdown (#119865)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119865
Approved by: https://github.com/ezyang
2024-02-15 02:20:07 +00:00
516f38a144 [RelEng] Define BUILD_BUNDLE_PTXAS (#119750)
That would bundle PTXAS into a `bin` folder

When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas

Needs 5c814e2527 to produce valid binary builds

Test plan:
- Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module.
- Run following script (to be added to binary tests ) against CUDA-11.8 wheel:
```python
import torch
import triton

@torch.compile
def foo(x: torch.Tensor) -> torch.Tensor:
  return torch.sin(x) + torch.cos(x)

x=torch.rand(3, 3, device="cuda")
print(foo(x))
# And check that CUDA versions match
cuda_version = torch.version.cuda
ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii")
assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}"
```

Fixes https://github.com/pytorch/pytorch/issues/119054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119750
Approved by: https://github.com/jansel, https://github.com/atalman
2024-02-15 02:08:57 +00:00
a07fd51b6b [caffe2] Add an avx512 implementation of adagrad_update (#113289)
Summary: As per title

Test Plan: contbuilds

Differential Revision: D50947444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113289
Approved by: https://github.com/ezyang
2024-02-15 01:45:30 +00:00
861acda205 Alternate sharding (#119078)
Changes sharding to attempt to put all serial tests on as few shards as possible.  Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards

Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests

Move serial tests to run first

If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.

See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
2024-02-15 01:32:44 +00:00
b4252d73b1 Make pattern matcher more robust (#119876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119876
Approved by: https://github.com/cccclai
2024-02-15 00:48:16 +00:00
daf1050ae5 [dtensor] refactor sharding cost model to count for latency (#119897)
This PR refactors the shardeing cost model, to do a more accurate
estimation of redistribute cost, including both collective latency and
communciation time.

The previous cost model does not recale the latency and communciation
time, therefore the latency factor is too small to be counted, and in
the case of small tensors, multiple collectives is preferred than a
single collective, which is wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119897
Approved by: https://github.com/tianyu-l
2024-02-15 00:35:56 +00:00
99cb807e25 Skip test_wrap_bad if run under pytest (#115070)
Pytest replaces sys.stdout/stderr by `TextIOWrapper` instances which do not support `fileno()`
Hence skip that test in this case

Fixes #115069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115070
Approved by: https://github.com/clee2000
2024-02-15 00:10:05 +00:00
d833e2f236 Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)
Thanks to discussion with @mikekgfb I've realized that SVE is the
feature availble by default on Apple Silicon, so let use it to speed up
portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
  union {
     uint16_t h;
     float16_t f16;
  } x = {h};
  return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--)

As results, very slow and naive [`torch.mm`](edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)) runs 3x faster: 85 msec before to 27 msec (measured by running e41341df2d/benchmarks/benchmark_torch_mm.py )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895
Approved by: https://github.com/mikekgfb
ghstack dependencies: #119892
2024-02-14 23:42:53 +00:00
096ebcca73 [FSDP2] Added gradient accumulation w/o reduction (#118298)
This PR adds a way to do gradient accumulation without collectives (i.e. reduce-scatter for FSDP and reduce-scatter/all-reduce for HSDP, though HSDP is not yet implemented). Since the `no_sync()` context manager has received some feedback, we simply define a method on the module to set whether the module requires gradient synchronization or not, where this method can recurse or not.
```
# Before with `no_sync()`:
with fsdp_model.no_sync() if not is_last_microbatch else contextlib.nullcontext():
  # Forward/backward

# After with a setter:
fsdp_model.set_requires_gradient_sync(not is_last_microbatch)
# Forward/backward
```
Having the method be able to recurse or not also gives some flexibility. For example, some large modules can still reduce-scatter, while some smaller modules can avoid it to save communication bandwidth:
```
fsdp_modules_to_reduce_scatter: Set[nn.Module] = ...
for module in fsdp_model.modules():
  if isinstance(module, FSDP) and module not in fsdp_modules_to_reduce_scatter:
    module.set_requires_gradient_sync(not is_last_microbatch)
# Forward/backward
```

(Separately, we may expose a helper for `return [module for model.modules() if isinstance(module, FSDP)]`.)

---

To show the spirit of this API choice, I also included `set_requires_all_reduce` that would give us the ability to only reduce-scatter but not all-reduce for HSDP (originally from the MiCS paper). If we want to flexibly support heterogeneous sharding where FSDP is applied to some modules and HSDP to others in the same model, then having a module-level method that has the option to not recurse makes sense to me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118298
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #119550, #118136, #118223, #118755, #119825
2024-02-14 23:09:59 +00:00
8f27fde2f5 [export] Log private api uses. (#119848)
Summary:
as title.
The following APIs are logged:
- capture_preautograd_graph
- torch._export.aot_compile
- external usage of _export_to_torch_ir (AOTInductor, Pippy)
- constraints API
- public use of torch._dynamo.export

Test Plan: CI

Differential Revision: D53735599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119848
Approved by: https://github.com/suo
2024-02-14 22:58:23 +00:00
340b6fa972 Deduplicate docs between global and non-global full backward hooks (#119708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119708
Approved by: https://github.com/albanD
ghstack dependencies: #114970
2024-02-14 22:53:44 +00:00
3713103db4 Revert "[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450)"
This reverts commit 4e93b00b692118b8531f3807ec95eb4c538ea419.

Reverted https://github.com/pytorch/pytorch/pull/119450 on behalf of https://github.com/soulitzer due to Regressed perf on the dashboard ([comment](https://github.com/pytorch/pytorch/pull/119450#issuecomment-1944876761))
2024-02-14 22:44:21 +00:00
756cf2913d Fix NJT stride access in SDPA dispatcher logic (#119846)
`._stride` -> `._strides`

Adds test to cover this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119846
Approved by: https://github.com/drisspg, https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #119910
2024-02-14 22:37:52 +00:00
0560c193a6 Fix meta registration for _flash_attention_forward() [ROCm forward fix] (#119910)
Addresses ROCm failures from #119812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119910
Approved by: https://github.com/drisspg
2024-02-14 22:37:52 +00:00
734ae20f2e [C10] Expand half unittest (#119892)
So far it's been only testing legacy conversion, rather than the one actually used when `at::Half` is constructed
Test `fp16` to `fp32` for the whole range of its 65536 values, though skip NaN comparisons, as different algorithms are not guaranteed to yield identical NaN representations and they are different anyway.

Do a small code cleanup, remove extraneous semicolons as well as named namespace inside unnamed one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119892
Approved by: https://github.com/kit1980
2024-02-14 22:32:43 +00:00
3470ab42bb [DCP] Automatically set no_dist if distributed is unavailable (#119813)
[DCP] Automatically set `no_dist` if distributed is unavailable

Differential Revision: [D53718043](https://our.internmc.facebook.com/intern/diff/D53718043/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119813
Approved by: https://github.com/fegin, https://github.com/wz337
2024-02-14 22:25:07 +00:00
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
9ec8dd2467 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-14 22:00:43 +00:00
6b04251b87 [inductor][scheduler] Use set for origin (#119861)
xref - https://github.com/pytorch/pytorch/issues/119440

This avoids node > node comparison if the origin order is same in the origins tuple. However, I am unable to come up with a test case where this could happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119861
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-02-14 22:00:38 +00:00
29235c7063 Handle aliases correctly in foreach (#119508)
Fixes https://github.com/pytorch/pytorch/issues/119436

<s>In essence we need to ensure aliases are run in separate foreach kernels so that they are ordered correctly. Previously, aliases could end up in the same kernel which creates weird scheduling dependencies.</s>

There was a bug in cycle detection/can_fuse which was creating cycles when more than two aliases were used in foreach nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119508
Approved by: https://github.com/jansel
2024-02-14 21:21:28 +00:00
e0f6fa6a7c Windows Dynamo Error Removal CI Check (#115969)
Rebase of #111313 onto `main`, for CI validation

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969
Approved by: https://github.com/PaliC, https://github.com/thiagocrepaldi
2024-02-14 21:14:36 +00:00
9201d7335a Add pixel_shuffle to core aten decomps (#119899)
Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53766709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-02-14 21:01:11 +00:00
244b124bb8 Add linux cpu test for 3.12 (#117853)
This is continuation of work: https://github.com/pytorch/pytorch/pull/113987

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853
Approved by: https://github.com/albanD
2024-02-14 20:52:23 +00:00
bb67a28738 [DTensor] Enable Adamax foreach optimizer (#119850)
Enable Adamax foreach optimizer and add DTensor unit test for Adamax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119850
Approved by: https://github.com/wanchaol
2024-02-14 20:43:00 +00:00
2aad3f93f8 Fix guards for field access through properties (#119719)
When building guards which went through a property we were analyzing the property using getattr_static but the guard wasn't built using getattr_static so if the property was "unusual" it generated misbehaved code which referenced a non-existent `__closure__` field.

Fixes #118786

Note that after this change some of the referenced tests are still failing with a different error - but getting further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119719
Approved by: https://github.com/oulgen
2024-02-14 20:42:55 +00:00
7797a8c2cb [testing][inductor] Allow grad tolerance override (#119844)
Introduce `grad_atol` and `grad_rtol` kwargs, default behavior is
preserved by using `atol` and `rtol` values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119844
Approved by: https://github.com/peterbell10
2024-02-14 20:18:48 +00:00
15f1b9f1c4 Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)
This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways:

* The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message.
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #117356
2024-02-14 20:01:07 +00:00
0e6eee3c89 [ROCm] TunableOp (#114894)
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides.

See the README.md for additional details.

TunableOp was ported from onnxruntime starting from commit 08dce54266.  The content was significantly modified and reorganized for use within PyTorch.  The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following:

- onnxruntime/core/framework/tunable.h -> Tunable.h
- onnxruntime/core/framework/tuning_context.h -> Tunable.h
- onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h
- onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h
- onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h
- onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h
- onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h
- onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-02-14 19:03:49 +00:00
90f785dc34 Change default TORCH_LOGS format to match Meta/glog standard (#119869)
Before:

```
[2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['x'], 70049616)                            # assert x.shape[0] > 2  # b.py:5 in f
[2024-02-13 19:34:50,592] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # assert x.shape[0] > 2  # b.py:5 in f
```

After this change, the logs look like this:

```
V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1023 [0/0] GUARDS:
V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] ___check_type_id(L['x'], 70050096)                            # assert x.shape[0] > 2  # b.py:5 in f
V0214 07:00:49.355000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # assert x.shape[0] > 2  # b.py:5 in f
```

The main differences from what we had before:

* We don't print DEBUG/INFO/WARNING, instead, we only print a single character. DEBUG, somewhat oddly, maps to V, because it corresponds to glog VERBOSE
* The year is omitted, and a more compact representation for date/month is adopted. Somewhat perplexingly, six digits are allocated for the nanoseconds, even though Python typically doesn't have that level of resolution
* The thread ID is included (in a containerized environment, this thread id will be typically much lower)
* Instead of using the module name, we give a filepath, as well as the line the log message was emitted from. I think the line number is a nice touch and improvement over our old logs, but one downside is we do lose the artifact name in the log message, in case anyone was grepping for that.
* I chose to move the compile id prefix to the very end so as to keep a uniform layout before it, but I do think there are benefits to having it before the filename

Meta only: This format was reverse engineered off of 6b8bbe3b53/supervisor/logging.py and https://www.internalfb.com/code/fbsource/[e6728305a48540110f2bdba198aa74eee47290f9]/fbcode/tupperware/front_end/log_reader/filter/StreamingLogLineFilter.cpp?lines=105-114

Now, I think this may be slightly controversial, but I have chosen to apply this format *by default* in OSS. My reasoning is that many PT2 developers work with the logs in OSS, and keeping the format identical to what we run in prod will make it easier for these skills to transfer.

The non-negotiable portion of the new format is "V0213 19:28:32"; the date string is expected to be in exactly this form or Tupperware will fail to parse it as a date.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119869
Approved by: https://github.com/oulgen, https://github.com/mlazos, https://github.com/Skylion007
2024-02-14 18:56:35 +00:00
d999222fba [dtensor] add op support for nll_loss_backward (#119256)
As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256
Approved by: https://github.com/wanchaol
2024-02-14 18:50:42 +00:00
47182a8f4b Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-14 18:40:23 +00:00
6cf48187c5 [export] Remove references to capture_pre_autograd_graph inside test_export (#119875)
Summary: Title

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D53728889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119875
Approved by: https://github.com/angelayi
2024-02-14 17:59:10 +00:00
ee3a7bdc2d [export] Don't error if nn_module_stack doesn't contain a class (#119753)
Summary: When we deserialize nn_module_stack, sometimes the module no longer exists in the python environment so we cannot deserialize it back into the python type and instead it's kept as a string. This causes downstream failures when retracing due to one of our checks in export. This diff just bypasses the check.

Test Plan: CI

Reviewed By: chakriu

Differential Revision: D53527706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119753
Approved by: https://github.com/zhxchen17
2024-02-14 16:56:11 +00:00
3e21c785a4 [ROCm] Initial ir.Scan/aten.cumsum lowering support on ROCm (#119369)
It was noted in https://github.com/pytorch/pytorch/pull/117992 that ROCm is still falling back to eager with scan's with inductor.

Initially as part of https://github.com/pytorch/pytorch/pull/106581 ROCm was disabled on this feature due to lack of triton support.

This PR will enable support for lowering scan operations on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119369
Approved by: https://github.com/peterbell10
2024-02-14 16:13:46 +00:00
fb492f7ca1 [inductor] Reorder if check to avoid more expensive check. (#119817)
If `mkldnn` is not enabled or not available there is no point in performing a relatively expensive `all` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119817
Approved by: https://github.com/Skylion007
2024-02-14 16:04:31 +00:00
184605ae7d [inductor] Replace generators with map. (#119818)
It's more concise and efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119818
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze
2024-02-14 16:02:52 +00:00
edd9ddf73f Propagate allow_non_graph_fake between get_fake_values_from_nodes and get_fake_values (#119731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119731
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #119314, #119435
2024-02-14 15:26:17 +00:00
cyy
87c6cd2f00 [1/N] Replace std::tie with structural binding (#119774)
This PR replaces some std::tie calls with structural binding from C++17.  This not only makes the code more compact, but also has some performance gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-14 09:25:04 +00:00
a45c627f27 [c10d][flight recorder] store a copy of string in entry (#119837)
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.

A quick fix is to store a copy of the string, but without changing the
upstream char*.

An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.

We chose the first option for this change

Resolve #119808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
2024-02-14 09:13:56 +00:00
4a50572c92 [inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867)
Summary:
When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here:

31e59766e7/torch/_inductor/ir.py (L3805-L3816)

This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion:

31e59766e7/torch/_inductor/ir.py (L3479)

Here we add a special case handling for this to unwrap `x` recursively.

Test Plan:
This local repro:

```
torch.compile()
def f(a, b, mat1, mat2):
    bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1)
    return torch.addmm(bias, mat1, mat2)
f(
    torch.randn(3992, 20, 40).cuda(),
    torch.randn(3992, 40, 192).cuda(),
    torch.empty(3992, 1024).cuda(),
    torch.empty(1024, 3840).cuda(),
)
```

with this line:

690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)

changed to `if cond(*args, **kwargs):` fails before and succeeds after this PR.

Differential Revision: D53743146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867
Approved by: https://github.com/xw285cornell
2024-02-14 07:50:34 +00:00
9f44274373 Add tests to verify disabled optimizers (#118919)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919
Approved by: https://github.com/janeyx99
2024-02-14 07:45:16 +00:00
ca55468416 Target Determinator Indexer Workflow (#118824)
As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator),  we are experimenting with using CodeLlama-powered information retrieval for target determination.

The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests.

This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3.
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-02-14 06:21:18 +00:00
caf9d9d7c1 [executorch hash update] update the pinned executorch hash (#119733)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119733
Approved by: https://github.com/pytorchbot
2024-02-14 06:15:25 +00:00
54a30f6d4e [Dynamo] Update trace_rules.py and re-enable skipped tests (#119860)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119860
Approved by: https://github.com/angelayi
2024-02-14 05:22:55 +00:00
8ba2675488 Fix for-loop divisibility parsing (#119859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119859
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835, #119836, #119838
2024-02-14 05:09:59 +00:00
1f0e4ac146 Add support for while-loops in ttir analysis (#119838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119838
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835, #119836
2024-02-14 05:09:59 +00:00
5ffac768f6 Add support for labels to ttir analysis (#119836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119836
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835
2024-02-14 05:09:59 +00:00
3f09c5ee66 Add TTIR verification (#119835)
Make sure the TTIR generated is valid before attempting to analyze. Incorrectly written triton code would produce broken TTIR. Minor discussion on https://github.com/openai/triton/issues/3120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119835
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834
2024-02-14 05:09:59 +00:00
b257ff80da Add test scf.for with multi return (#119834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119834
Approved by: https://github.com/aakhundov
2024-02-14 05:09:59 +00:00
72bbbab70a Add the missing test_dynamo_list_index from #119151 (D53392287) (#119854)
D53392287 botched the export somehow and the exported PR https://github.com/pytorch/pytorch/pull/119151 didn't contain the added test.  The discrepancy is showing up on diff train patch up diff D53694548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119854
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-02-14 04:10:02 +00:00
563f1b9fef [inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662)
`triton.testing.nvsmi` invokes `nvidia-smi` as a subprocess, and Meta
prod usually doesn't make nvidia-smi available.  Might as well just use
something that's native to torch.

Differential Revision: [D53235814](https://our.internmc.facebook.com/intern/diff/D53235814/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118662
Approved by: https://github.com/jansel
2024-02-14 03:23:49 +00:00
80379ef0aa [dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853)
Fixes https://github.com/pytorch/pytorch/issues/119715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119853
Approved by: https://github.com/jansel
2024-02-14 03:14:42 +00:00
4240304da4 [TorchElastic] Handle SystemExit with code == 0 (#119697)
Summary:
Fix for a case where --run-path option fails to exit if the script exits with non-error status code.
When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit.

Test Plan:
cat /tmp/script.py
~~~
import sys
def main():
    exit_code = 1
    if len(sys.argv) > 1:
        exit_code = int(sys.argv[1])
    sys.exit(exit_code)

if __name__=="__main__":
    main()
~~~

Case of exit code with 0 (prior behavior - never exits):
torchrun --run-path /tmp/script.py 0

~~~
[2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
(conda:pytorch) ➜  workspace echo $?
0
~~~

Existing behavior for non-zero exit code still works:
torchrun --run-path /tmp/script.py
~~~
(conda:pytorch) ➜  workspace torchrun --run-path /tmp/script.py
[2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]   File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]     self._pc.join(-1)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]   File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]     raise ProcessExitedException(
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Traceback (most recent call last):
  File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main
    run(args)
  File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-12_09:16:25
  host      : kurman-mbp.dhcp.thefacebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 64668)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
~~~

Differential Revision: D53653874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697
Approved by: https://github.com/wconstab
2024-02-14 03:09:09 +00:00
5ce305270b Add a decomposition for isin() (#115390)
Co-authored-by: Peter Bell <peterbell10@live.co.uk>
Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390
Approved by: https://github.com/peterbell10
2024-02-14 03:03:42 +00:00
75a6d6aef7 [inductor] Support storage resizing (#119749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749
Approved by: https://github.com/yf225
ghstack dependencies: #119647, #119671
2024-02-14 03:03:38 +00:00
31e59766e7 Fix meta registration for _flash_attention_forward() (#119812)
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
2024-02-14 02:38:53 +00:00
179ecab7e7 Do full checkout in lint workflow to rebuild new Docker images (#119858)
From https://github.com/pytorch/pytorch/pull/119575, using `fetch-depth: 1` didn't work for `calculate-docker-image` when rebuilding a new one.  Specifically, doing a full checkout is needed for `git rev-parse HEAD~:.ci/docker` to get the Docker tag.

This shows up as a trunk failure after the recent Docker image update 507db17675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119858
Approved by: https://github.com/PaliC, https://github.com/clee2000, https://github.com/malfet
2024-02-14 02:37:54 +00:00
690f54b0f5 [dynamo][nit] Cleanup analyze_kernel_mutations nits. (#119703)
Using `extend` is more efficient and other changes are stylistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119703
Approved by: https://github.com/Skylion007
2024-02-14 02:04:13 +00:00
f9f0c67445 beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826)
This extra check is needed for some more complicated parameter sizes/strides for an internal model

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119826
Approved by: https://github.com/albanD
2024-02-14 01:46:30 +00:00
c9459e7f55 Update atomicMaxFloat (#119577)
# Summary

Initially reported in https://github.com/pytorch/pytorch/issues/119320

I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions:
https://godbolt.org/z/3sKqEqn4M

However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working

### Update:
I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577
Approved by: https://github.com/yifuwang
2024-02-14 01:17:16 +00:00
suo
8e029dc616 [export] fix tuple return with symints (#119829)
as title.

Differential Revision: [D53726648](https://our.internmc.facebook.com/intern/diff/D53726648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119829
Approved by: https://github.com/zhxchen17, https://github.com/khabinov
2024-02-14 01:16:38 +00:00
4a5b2cd6cb Revert "Windows Dynamo Error Removal CI Check (#115969)"
This reverts commit 45e7af5818f1d4ab1cf568390b3721b9be4251a9.

Reverted https://github.com/pytorch/pytorch/pull/115969 on behalf of https://github.com/PaliC due to this pr ended up breaking some of our periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115969#issuecomment-1942934386))
2024-02-14 01:11:46 +00:00
16369816a2 [sparse] semi-structured sparse refactor (#117302)
Summary:

This PR is a refactor of semi-structured sparsity support.

**deprecation**:

Before `torch.sparse.to_sparse_semi_structured` had a kwarg param
`transposed=False`, which has been removed. This kwarg was unused and
now thros a deprecation warning.

Namely, I've taken the subclassing implementation that xFormers has
created and brought it over to PyTorch, as part of our plan to upstream
runtime 2:4 sparsity.

I've also copied over all the op support that Daniel implemenented that
did not depend on the fast sparsification routines, into
`_sparse_semi_structured_ops.py`

With this subclass, all of our internal tests pass, as well as those in
xFormers.

The main change is that we now define a base subclass,
`SparseSemiStructuredTensor` that is inherited from for each of the
specific backends.

We also now can arbitrarily override the sparse dispatch table with
`_load_dispatch_table()`, idea being this is still general enough
where users don't need to modify pytorch source code to get their model
working.

This also adds in padding support and stores alg_id and fuse_transpose
as flags on the tensor, instead of hardcoding them.

There still remains two components in xFormers that will need to be
ported over eventually:
- the autograd functions  (`Sparsify24`, `Sparsify24_like`)
- fast sparsification routines that they rely on

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302
Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles
2024-02-14 01:10:40 +00:00
2536c5186e [BE] Properly mark destructor overrides (Take 2) (#119656)
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MTIAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~CUDAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MPSHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
```

 Likely introduced by https://github.com/pytorch/pytorch/pull/119329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
2024-02-14 01:05:58 +00:00
cyy
cb0886ecf2 [DeviceIndex][4/N] Use DeviceIndex in more places (#119741)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741
Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang
2024-02-14 00:29:10 +00:00
suo
b2e779868f make internal lintrunner mypy clean (#119840)
as title

Differential Revision: [D53732505](https://our.internmc.facebook.com/intern/diff/D53732505/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119840
Approved by: https://github.com/ezyang
2024-02-14 00:25:42 +00:00
507db17675 Update HF pin (#119717)
Sometime between now and the previous pin update, HF introduced a
ModelOutputs type, which was not pytree serializable, causing
aot_compile to fail on new HF models
(https://fb.workplace.com/groups/1075192433118967/permalink/1377977852840422/).
With https://github.com/huggingface/transformers/pull/27871, we
can now pytree serialize HF ModelOutputs types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119717
Approved by: https://github.com/desertfire
2024-02-14 00:17:16 +00:00
b51e0246b7 sccache version update (#119554)
Fixes #37928

`sccache` is updated to the newer version (`v0.7.4`) to fix non-cacheable calls `multiple input files`  for `CUDA` builds.

This should make `Cache hits (CUDA)`  work as expected and improve the speed dramatically.

---

Additional information:

- Modified `install_sccache.bat` check structure due to GitHub Action error `Process completed with exit code 255.`
    - Error is occurring when freshly downloaded `sccache` is being called with `--show-stats` or `--start-server` arguments within the script
    - Now, it is checking file's existence and killing/deleting executable before the download

- Removed `sccache-cl` since it is no longer needed with newer versions of `sccache`

---

`win-vs2019-cpu-py3 / build` - `16m 27s`

![image](https://github.com/pytorch/pytorch/assets/148207261/b5628e6c-64bb-4293-9d07-480f56df44f1)

`win-vs2019-cuda11.8-py3 / build` - `17m 4s` **(previously ~45 mins - 1h30mins)**

![image](https://github.com/pytorch/pytorch/assets/148207261/e4ab01cb-0f56-41e8-984f-110e643b9c09)

Now `Cache Hits (CUDA)` hits all `304` object and the error `Non-cacheable reasons` is fixed.

![image](https://github.com/pytorch/pytorch/assets/148207261/c8c25d2e-3fc1-4edb-8982-99c1f490cb54)

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119554
Approved by: https://github.com/malfet
2024-02-13 23:50:40 +00:00
be35fc9ea7 Size oblivious test for slice optimization (#119625)
Fixes https://github.com/pytorch/pytorch/issues/119623

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119625
Approved by: https://github.com/albanD
2024-02-13 23:47:52 +00:00
d81d5f52d5 [FSDP2][ez] Replaced groupby with all for same-dtype check (#119825)
The `groupby` logic to check if all all-gather inputs have the same dtype is not so readable. Let us use `all` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119825
Approved by: https://github.com/Skylion007
ghstack dependencies: #119550, #118136, #118223, #118755
2024-02-13 23:28:53 +00:00
cf117e37d5 Refactor THPStorage_resize_ (#119671)
Moving code around to allow it to be reused in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671
Approved by: https://github.com/yf225
ghstack dependencies: #119647
2024-02-13 23:28:47 +00:00
ca777fbbb7 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-02-13 23:15:24 +00:00
e9b78f2db0 Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324)
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.

Fixes #98467

Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).

Fusion is still slow - but at least finishes.

After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s

Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
2024-02-13 22:54:53 +00:00
ba1eb0e27f [ROCm] upgrade CI to 6.0 (#119495)
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119495
Approved by: https://github.com/huydhn
2024-02-13 22:39:03 +00:00
df9b44436a [ROCm] Enable float16/complex32 fft tests on ROCm (#117296)
This PR is to enable float16/complex32 fft tests on ROCm.
Sample results are attached here:
[test_spectral_ops_results.log](https://github.com/pytorch/pytorch/files/13908533/test_spectral_ops_results.log)

test_decomp::TestDecompCUDA::test_comprehensive_fft*
test_decomp::TestDecompCUDA::test_quick_fft*
test_jit_fuser_te::TestNNCOpInfoCUDA::test_nnc_correctness_fft*
test_meta::TestMetaCUDA::test_dispatch_meta_inplace_fft*
test_meta::TestMetaCUDA::test_dispatch_meta_outplace_fft*
test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_inplace_fft*
test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_fft*
test_meta::TestMetaCUDA::test_meta_inplace_fft*
test_meta::TestMetaCUDA::test_meta_outplace_fft*
test_ops::TestCommonCUDA::test_complex_half_reference_testing_fft*
test_ops::TestCommonCUDA::test_python_ref__refs_fft*
test_ops::TestCommonCUDA::test_python_ref_executor__refs_fft*
test_ops::TestCommonCUDA::test_python_ref_meta__refs*
test_ops::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft*
test_schema_check::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft*
test_spectral_ops::TestFFTCUDA::test_empty_fft__refs_fft*
test_spectral_ops::TestFFTCUDA::test_empty_fft_fft*
test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error__refs_fft*
test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error_fft*
test_spectral_ops::TestFFTCUDA::test_fft_round_trip_cuda*
test_spectral_ops::TestFFTCUDA::test_fft_type_promotion_cuda*
test_spectral_ops::TestFFTCUDA::test_fftn_round_trip_cuda*
test_spectral_ops::TestFFTCUDA::test_hfftn_cuda_float16
test_spectral_ops::TestFFTCUDA::test_ihfftn_cuda_float16
test_utils::TestDeviceUtilsCUDA::test_device_mode_ops_fft

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117296
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-02-13 22:35:32 +00:00
63d64c8995 [MPS] Enable more bfloat16 ops (#119738)
Introduce conveninence inlinable `mps::supportedFloatingType` function
that returns true if type is Float, Half or BFloat16

Test by running LLM inference using bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119738
Approved by: https://github.com/Skylion007
2024-02-13 22:11:00 +00:00
eb9a3383c2 [MPS] Add naive std_mean implementation (#119777)
By just calling `std_mps` and `mean` in sequence

Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer

import torch
from torch.utils.benchmark import Measurement, Timer

def bench_var_mean(
    m, n, k,
    dtype = torch.float32,
    device:str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()

for x in [100, 1000]:
    rc = bench_var_mean(1000, x, 100, device="mps")
    print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.

Fixes https://github.com/pytorch/pytorch/issues/119663

TODOs:
 - Refactor the codebase and implement proper composite function (that must be faster)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
2024-02-13 21:51:29 +00:00
ee5b59dd4b [ROCm] CatArrayBatchedCopy performance improvement (#118685)
Tune the grid and block sizes for ROCm.  Add a contig kernel separate from aligned+contig.

Verified new performance using pytorch/benchmarks/operator_benchmark.

`python -m pt.cat_test --device=cuda --tag-filter all`

On MI200 this improved performance on average 4%, and on MI300 14%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685
Approved by: https://github.com/malfet
2024-02-13 21:51:20 +00:00
6665b96ebb Rewrite maybe_reduce more carefully for unbacked SymInt (#119562)
Fixes https://github.com/pytorch/pytorch/issues/119476

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562
Approved by: https://github.com/albanD
ghstack dependencies: #119559
2024-02-13 21:40:06 +00:00
28f299a870 [c10d] Fix compilation of NCCL_EXP path (#119805)
Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621

When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly.

Cc: @kunalb @H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805
Approved by: https://github.com/H-Huang
2024-02-13 21:26:59 +00:00
f9200c8608 [BE][Ez]: FURB129: remove unneeded readlines() (#119796)
Applies a refurb rule to remove any readlines() in a for loop iteration as it just creates a temporary list in memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119796
Approved by: https://github.com/ezyang
2024-02-13 21:21:22 +00:00
3319dbcd23 Update vmap guard to avoid recompilations (#119061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061
Approved by: https://github.com/zou3519
2024-02-13 20:50:23 +00:00
abadbbc4b0 [c10d][flight recorder] remove unintended assignment of entry (#119748)
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.

Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock

Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
2024-02-13 20:18:58 +00:00
34638c82a6 [mergebot] No unique behavior for facebook bot re pending jobs (#119735)
if fb bot says merge without -f, do normal behavior and wait for pending checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119735
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-02-13 20:07:24 +00:00
8ec3d8e35f Fixed FxGraphDrawer compat constructor (#119767)
Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed:
```
  File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph
    g = graph_drawer.FxGraphDrawer(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767
Approved by: https://github.com/eellison
2024-02-13 19:36:01 +00:00
8ec8d78ef2 [quant][pt2e][be] Rename eval_utils -> export_utils (#119725)
It's not really eval_utils anymore, since we added some training
related utils. Instead it should be util functions that are
related to general export use cases.

Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725
Approved by: https://github.com/tugsbayasgalan
2024-02-13 19:10:06 +00:00
0a2e000edf [BE] Enabled mypy in common_fsdp.py (#118755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118755
Approved by: https://github.com/Skylion007, https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550, #118136, #118223
2024-02-13 19:05:30 +00:00
8c1480f568 [FSDP2] Added mixed precision (#118223)
This PR adds mixed precision configured via `MixedPrecisionPolicy`.
- By default (`cast_forward_inputs=True`), each FSDP module will cast forward floating-point input tensors to `param_dtype` if specified. If the user wants to own the cast, then the user can disable it by passing `False`.
- Symmetrically, by default (`output_dtype=None`) each FSDP module will not cast the forward output. If the user wants to customize the output dtype, then the user can pass a `torch.dtype`.
- `param_dtype` configures the unsharded parameters' dtype for forward/backward computation and hence the all-gather dtype.
- `reduce_dtype` configures the gradient reduction dtype. If `reduce_dtype=None` and `param_dtype is not None`, then `reduce_dtype` inherits from `param_dtype` for simplicity.

We test against a manually implemented reference implementation instead of comparing against existing FSDP since the comparison is more direct to what we want to test.

---

**Overhead benchmarks to inform design**
The dilemma is as follows:
- The common path for FSDP is bf16 parameter mixed precision, where we cast sharded parameters from fp32 to bf16 before all-gathering them.
- The baseline implementation is to `torch._foreach_copy_` the sharded parameters to the flat `all_gather_input`, which gets passed to `dist.all_gather_into_tensor`.
    - The baseline incurs 1 extra fp32 read and 1 extra bf16 write per parameter because `_foreach_copy` takes the slow path, calling `copy_` in a loop, and `copy_` calls `dst.copy_(src.to(bf16))` where `dst` is bf16 and `src` is fp32.
    - These `copy_` calls stay in C++ and do not require calling `at::as_strided`.
- The issue with this baseline implementation is that it requires knowing that all parameters in the group will be cast from fp32 to bf16 to do this `_foreach_copy_` from fp32 sources to a bf16 destination.
- We want per-parameter FSDP to support mixed dtype all-gathers, which would involve different parameters providing different dtype all-gather inputs and viewing them as uint8 for a combined flat all-gather input, where this viewing-as-uint8 step is only needed in the mixed dtype case.
- However, this incurs more CPU overhead, so we want to investigate this in more detail.

We consider 150 `nn.Parameter`s with shapes taken from an internal model (where the shapes only affect the copy bandwidth, not the CPU overhead). We focus on world size 128 first. We consider two experiments: (1) run the copy-in with no head start, allowing CPU boundedness affect GPU time, and (2) run the copy-in with a CPU head start, removing CPU overhead from affecting GPU time.

No head start:
- Baseline `torch._foreach_copy_`: 0.525 ms CPU; 0.528 ms GPU
- `.to(bf16)` before `torch._foreach_copy_`: 0.828 ms CPU; 0.836 ms GPU
- `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.933 ms CPU; 0.937 ms GPU

Head start (removing CPU boundedness from GPU times):
- Baseline `torch._foreach_copy_`: 0.393 ms GPU
- `.to(bf16)` before `torch._foreach_copy_`: 0.403 ms GPU
- `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.403 ms GPU

Some other interesting notes:
- Constructing a set of all all-gather input dtypes: ~0.015 ms -- this would be the overhead cost of checking whether we need to view as uint8 (i.e. whether we have mixed dtype); alternatively, we could always view as uint8 (but that loses the mixed precision policy info from the profiler trace)
- Changing from `[t.to(bf16).view(uint8) for t in ts]` to two list comprehensions like `[t.to(bf16) for t in ts]; [t.view(uint8) for t in ts]` actually reduces CPU overhead 🤔  (by ~0.04 ms)

We see that the main difference is just CPU overhead. The GPU times are almost the same. (Actually, sweeping over 8, 16, 32, 64 world size, we do see difference in GPU time inversely proportional to world size, as expected since smaller world sizes copy more data. However, even at world size 8, the difference is only 0.407 ms vs. 0.445 ms GPU time.) Note though that the CPU overhead differences are exacerbated when the PyTorch profiler is turned on, and how much so seems to depend on the CPU capability.

Seeing these numbers, I am inclined to prefer to just incur the CPU overhead, especially given that if we want to support the mixed dtype case for fp8 all-gather, we will need to incur this anyway. If the CPU overhead becomes a problem on a real workload, then we will need to figure out options then, one being using `torch.compile` possibly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118223
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550, #118136
2024-02-13 19:05:30 +00:00
3956ce01e0 [FSDP2] Added autograd/memory/overlap/frozen/2D/AC tests (#118136)
This PR adds tests for autograd (mainly backward hooks), memory, overlap, and frozen parameters.
- Autograd: unused forward output, unused forward module, non-tensor activations (common in internal models)
- Memory: expected GPU memory usage after init, forward, backward, and optimizer step
- Overlap: communication/computation overlap in forward and backward
- Frozen: expected reduce-scatter size, training parity

This PR adds some initial 2D (FSDP + TP) training and model state dict tests. The only change required for model sharded state dict is to make sure parameters are sharded before save and load.

This PR adds tests that `fully_shard` can use `torch.utils.checkpoint`, `_composable.checkpoint`, and `CheckpointWrapper` on a transformer.

(I squashed all of these into one PR now to save CI cost.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118136
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550
2024-02-13 19:05:30 +00:00
39c68efd85 [dynamo] Capture untyped_storage().resize_() (#119647)
This makes storage resizing work with `backend=eager`, the next two PRs make it work for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119647
Approved by: https://github.com/yf225
2024-02-13 19:03:28 +00:00
c0e5cca4f8 [DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437)
Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437
Approved by: https://github.com/wconstab, https://github.com/xmfan
2024-02-13 16:53:56 +00:00
c2522554dd Prevent DCE'ing unbacked SymInt for view outputs (#119552)
Fixes https://github.com/pytorch/pytorch/issues/119414

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119552
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-02-13 16:32:21 +00:00
52de407b6c Avoid performing replacements when it would unrefine ranges (#117356)
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.

This PR does the following:

* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
2024-02-13 15:56:59 +00:00
0fd371c868 fix torch.cumsum docs (#117944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117944
Approved by: https://github.com/zou3519
2024-02-13 15:29:06 +00:00
c2a835d710 [inductor] Refactor device guard Python codegen to allow nested indentation (#119673)
Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing).

In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing).

This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673
Approved by: https://github.com/peterbell10
2024-02-13 15:05:30 +00:00
f4b5f710e8 Fix typo in private attr of inference_mode (#119167)
This PR amends #102642.

`torch.inference_mode`'s attribute to store the actual context is inconsistent between `__init__` and `__enter__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119167
Approved by: https://github.com/albanD
2024-02-13 14:59:59 +00:00
3629287151 Implement analysis for for-loops (#119730)
This PR adds support for for-loop parsing and analysis. While doing so, I ran into some constant value and function name problems so I fixed them as well. Technically, it should be possible to break this into multiple PRs but since these are small, I'm bundling them together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119730
Approved by: https://github.com/aakhundov
2024-02-13 09:02:53 +00:00
2ae655b4f1 caffe2: remove support for specifically running "flaky tests" (#112007)
Summary:
In March 2019 D14468816 introduced some infra to mark tests as flaky
while still running them. In July 2019 D15797371 removed the last use of this
feature. Remove the related code as well.

Test Plan: ci

Reviewed By: mlogachev

Differential Revision: D50601204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112007
Approved by: https://github.com/malfet
2024-02-13 07:56:37 +00:00
60148f1761 [EZ] Set maximum supported version of Python as 3.12 (#119743)
Doesn't really affect anything other than metadata on PyPI website
Otherwise programming languages tab on https://pypi.org/project/torch/2.2.0/ shows supported version 3.8 to 3.10:
<img width="239" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/e17f9982-8833-4cd8-b8d8-b2f1cb538548">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119743
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-02-13 06:56:32 +00:00
beb0041384 improve cuda graph symint logging msg (#119739)
Users were confused by `recording cudagraph tree for None`
`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119739
Approved by: https://github.com/mlazos
2024-02-13 06:26:36 +00:00
bfb9ea1a43 fix compile DTensor.from_local in trace_rule_look up (#119659)
There's a bug when converting from TorchVariable to trace rule look ups,
in some corner cases the DTensor.from_local calls not matching the trace
name rule look up, resulting in a None look up, and falling back to the
UserFunctionVariable, which makes the tracing silent wrong by tracing
into the DTensor.from_local function. Not exactly sure yet why the look
up failed

This PR fixes the DTensor.from_local tracing to make sure in everycase
we should hit the InGraphFunctionVariable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119659
Approved by: https://github.com/yifuwang
2024-02-13 05:21:19 +00:00
379183a0dd Skip log line if no tensors were dedupped (#119742)
Skips log line if nothing was dedupped. Avoids unhelpful logs like:
```
2024-02-13 01:31:52,113 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119742
Approved by: https://github.com/Skylion007
2024-02-13 05:18:16 +00:00
a4c476a081 [BE] Use more GTest primitives in XPU unit tests (#119527)
# Motivation
Use `EXPECT_EQ` to refine XPU's UT when relying on gtest.

# Solution
use `EXPECT_EQ` directly instead of `ASSERT_EQ_XPU`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119527
Approved by: https://github.com/malfet
2024-02-13 05:18:03 +00:00
cyy
47a2e6b6b8 Fix C++20 build (#112333)
Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333
Approved by: https://github.com/albanD
2024-02-13 05:10:19 +00:00
2bda6b4cb8 [DTensor] Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716)
Summary:
This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`.

Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")`

Test Plan:
**CI**:
Wait for the CI test

**Test with prod model**:
- Tested with models and no-longer ran into the exception after checkpoint loading.

Differential Revision: D53680406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337
2024-02-13 04:30:45 +00:00
2502a01110 Linear-BN Fusion: add precondition check (#119264)
Fixes #118990

The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`.

To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264
Approved by: https://github.com/eellison
2024-02-13 04:16:34 +00:00
15ef52a015 [MPS] Enable conj and conj_physical (#119669)
Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor

Preliminary step for enabling FFT ops (without it `ifft` would never work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669
Approved by: https://github.com/albanD
ghstack dependencies: #119681
2024-02-13 02:27:51 +00:00
214f06ae3a Revert "Add Accelerator device and shell hooks (#119329)"
This reverts commit 4b9568a360c4a90220e78e43435be8c56bc33fb2.

Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))
2024-02-13 02:23:45 +00:00
7d4b666870 Revert "[BE] Properly mark destructor overrides (#119656)"
This reverts commit 069581b3ca354c3b34079d23bc237442d6f28cc3.

Reverted https://github.com/pytorch/pytorch/pull/119656 on behalf of https://github.com/huydhn due to I need to revert this to unblock the revert of https://github.com/pytorch/pytorch/pull/119329#issuecomment-1939637967 and will reland this after resolving the conflicts ([comment](https://github.com/pytorch/pytorch/pull/119656#issuecomment-1940270997))
2024-02-13 02:20:45 +00:00
2921c2b3d9 [mypy] refactor mkldnn_fusion._is_valid_binary to avoid [union-attr] has no attribute (#119085)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119085
Approved by: https://github.com/Skylion007
2024-02-13 02:13:46 +00:00
db228f1efd [Lint] replace [assigment] with [method-assign] for methods (#119706)
started with TODO fix from here https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py#L746
using ignore[method-assign] instead of ignore[assigment]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119706
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kit1980
2024-02-13 02:06:04 +00:00
9f8c84a399 Revert "Add missing include for internal build (#119721)"
This reverts commit e0cabebad94f1cf35742f8ec14f9938be3a195ab.

Reverted https://github.com/pytorch/pytorch/pull/119721 on behalf of https://github.com/huydhn due to This fixes the build failures but there is still an issue with the missing libcaffe2_torch_fb_sparsenn_sparsenn_operators_gpu.so on D53686094 ([comment](https://github.com/pytorch/pytorch/pull/119721#issuecomment-1940191340))
2024-02-13 01:56:12 +00:00
ea8e4fd5ac Support FunctoolsPartialVariable::get_function, fix NamedTupleVariable::as_proxy and handle call_function in get_fake_values_from_nodes (#119435)
partially address https://github.com/pytorch/pytorch/issues/118785
This diff fixes three things:
1. add get_function to FunctoolsPartialVariable note that it will be available only if all args constant otherwise,
it would throw unimplemented in the call to asPythonConstant.

2. NamedTupleVariable takes args dispatched not as list ex: NamedTuple(a, b, c) vs NamedTuple([a, b, c]),
 hence fix that by specializing asProxy.

3. A call to create_arg from within create_proxy, changes a python NamedTuple to a function call node without
associating an example value! Updated get_fake_values_from_nodes to handle such case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119435
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #119314
2024-02-13 01:44:08 +00:00
74d55b0e63 [dynamo] Support torch.distributed.fsdp._flat_param._same_storage_size (#119627)
Replaces #117690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119627
Approved by: https://github.com/Skylion007
2024-02-13 01:27:37 +00:00
472500e32a Revert "Avoid performing replacements when it would unrefine ranges (#117356)"
This reverts commit 0e6b314fc2e7c965717e939a4e457a9b9d7e133e.

Reverted https://github.com/pytorch/pytorch/pull/117356 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/117356#issuecomment-1940032407))
2024-02-13 01:16:58 +00:00
2492f8748e Revert "Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)"
This reverts commit f208795182a22ebaef84a284750669fa372157cb.

Reverted https://github.com/pytorch/pytorch/pull/119412 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/119412#issuecomment-1939937937))
2024-02-13 00:52:19 +00:00
830ed6d9b2 [quant][pt2] Fix _disallow_eval_train error message (#119694)
Fix the message to use the right function name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119694
Approved by: https://github.com/tugsbayasgalan
2024-02-13 00:17:53 +00:00
55483fc2c9 Min-cut partitioner always saves tensors that are returned as-is in backward (#114970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114970
Approved by: https://github.com/Chillee
2024-02-13 00:04:41 +00:00
bd9db6a9c7 Update to TorchFix 0.4.0 (#119424)
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
5acd1f0f7d Add cherry-pick workflow (#119352)
After https://github.com/pytorch/test-infra/pull/4758, we can create a new workflow on PyTorch to receive `try-cherry-pick` dispatch event from the bot, and create the cherry pick PR.

* [x] Cherry pick a PR after it has been landed and create a cherry pick PR to the target release branch.
* [ ] The second part after this is to update the release tracker with the info.  This will be done in a subsequent PR.
* [ ] ghstack is not yet supported
* [ ] Cherry pick a reverted commit is not yet supported (from @kit1980 comment)

### Testing

The script can be used locally:

```
python cherry_pick.py --onto release/2.2 --classification release --github-actor huydhn 118907
The cherry pick PR is at https://github.com/pytorch/pytorch/pull/119351
```

The test cherry pick PR is created at https://github.com/pytorch/pytorch/pull/119351

Unit testing this on CI is tricky, so I test this out on canary instead.

* https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933162707 creates the PR at https://github.com/pytorch/pytorch-canary/pull/201
  * One more test on canary with the new token https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933229483.  The minimum required permission from what I see is `workflow`
* Cherry picking conflicts could happen and needs to be handled manually https://github.com/pytorch/pytorch-canary/pull/194#issuecomment-1933142975
* ~Require a linked issue when cherry picking regressions, critical fixes, or fixing new features https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933174520~ Relax this requirement to a suggestion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119352
Approved by: https://github.com/atalman
2024-02-12 23:12:10 +00:00
suo
f15b517055 [export] suppress type error (#119720)
Differential Revision: [D53681243](https://our.internmc.facebook.com/intern/diff/D53681243/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119720
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-02-12 22:54:36 +00:00
b3df3e4e94 Restore OpInfo/ModuleInfo tests in Inductor-wrapped tests (#119693)
I accidentally disabled this without realizing it. It turns out that
PYTORCH_TEST_WITH_INDUCTOR=1 implies PYTORCH_TEST_WITH_DYNAMO=1, which
activates skipIfTorchDynamo decorators.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119693
Approved by: https://github.com/bdhirsh
2024-02-12 22:44:45 +00:00
e0cabebad9 Add missing include for internal build (#119721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119721
Approved by: https://github.com/huydhn
2024-02-12 22:36:16 +00:00
70c93c6097 [inductor] Update JIT Inductor cpp wrapper entry function signature (#119280)
Summary: Change JIT Inductor cpp wrapper entry function to use similar signature as AOTInductor, i.e. using an array of AtenTensorHandle instead of a vector of at::Tensor as the inputs and return output through a pointer. This makes it easier to consolidate the ABI compatible and non-compatible modes.

Differential Revision: [D53478825](https://our.internmc.facebook.com/intern/diff/D53478825)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119280
Approved by: https://github.com/chenyang78
2024-02-12 22:24:35 +00:00
02b60e76c9 make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500)
`dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous.

Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this:
```
    grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2)
    grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    return grad_q, grad_k, grad_v
```

But (I think?) the logic in the sdpa backward impl was a typo.

I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523).

A minimal repro that I made looks like this:
```
import torch

# in this repro, "grad_out" and "value" are transposed tensors,
# but "key" and "value" are contiguous
a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
f = torch.randn(2, 16, 513, device='cuda')
g = None
h = None
i = 513
j = 513
k = 0.0
l = False
m = torch.tensor(1, dtype=torch.int64)
n = torch.tensor(1, dtype=torch.int64)

out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

from torch._meta_registrations import meta__scaled_dot_product_flash_backward
out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

# prints True True
print(out1_ref.is_contiguous())
print(out1_test.is_contiguous())

# prints True True
print(out2_ref.is_contiguous())
print(out2_test.is_contiguous())

# prints True False
print(out3_ref.is_contiguous())
print(out3_test.is_contiguous())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007
2024-02-12 22:12:29 +00:00
cyy
10789ccd83 Remove redundant CMake NUMA code (#119650)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119650
Approved by: https://github.com/ezyang
2024-02-12 21:53:44 +00:00
34a61c527b Revert "Enable x86 CPU vectorization on windows (#118980)"
This reverts commit 5f69d95b2b303382fe4cf301e73e36414c879c5c.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to This is breaking Window binary build https://github.com/pytorch/pytorch/actions/runs/7874475000/job/21484997298 where it failed to build sleef ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-1939619212))
2024-02-12 21:33:14 +00:00
cyy
10f3abc6b8 [DeviceIndex][3/N] Use DeviceIndex in more places (#119635)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635
Approved by: https://github.com/ezyang
2024-02-12 21:31:27 +00:00
064b61009b Correctly formatting the example in get_state_dict (#119532)
This PR corrects the example formatting provided in https://pytorch.org/docs/stable/distributed.checkpoint.html. In this issue, @wz337 is also commenting that the return type was not showing up correctly. I didn't see any formatting issue, but I could be wrong.

Fixes #118837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119532
Approved by: https://github.com/fegin
2024-02-12 21:28:22 +00:00
ad217d4266 [ez] Add try catch for deleting old branches (#119696)
I think some chars in branch names affect the api calls, so just assume they're protected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119696
Approved by: https://github.com/huydhn
2024-02-12 21:08:59 +00:00
7eecbf8a30 Remove unnecessary skipIfTorchDynamo from test_jit_fuser_te (#118728)
And add some expected failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118728
Approved by: https://github.com/bdhirsh
2024-02-12 20:55:29 +00:00
28c30f29be Update documentation for set_flush_denormal support on ARM (#119354)
**Documentation update for set_flush_denormal():**
-> set_flush_denormal() is now supported on ARM CPU's.
-> **PR:** https://github.com/pytorch/pytorch/pull/115184  (Already merged)

**Reference page:** https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119354
Approved by: https://github.com/drisspg
2024-02-12 20:53:22 +00:00
7d780ff86f Revert "Enable fake tensor caching in fbcode by default (#118555)"
This reverts commit 0f2fbbff109cbc184a6a88247813dbcddaea2e5f.

Reverted https://github.com/pytorch/pytorch/pull/118555 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing one model test internally. Please take a look at the diff for more info D53189048 ([comment](https://github.com/pytorch/pytorch/pull/118555#issuecomment-1939550273))
2024-02-12 20:51:23 +00:00
110919c984 Check QNNPACK support for the platform before running test (#119139)
Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported.

For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac:
[----------] 1 test from ConstantPropagation
[ RUN      ] ConstantPropagation.CustomClassesCanBePropagated
unknown file: Failure
as described in more details in the issue #88613.

After the PR, test passes successfully as below:
[----------] 1 test from ConstantPropagation
[ RUN      ] ConstantPropagation.CustomClassesCanBePropagated
[       OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms)
[----------] 1 test from ConstantPropagation (0 ms total)

Fixes #88613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139
Approved by: https://github.com/jcaip
2024-02-12 20:21:07 +00:00
7adfeba47a Add Python 3.12 as experimental to release 2.2 (#119705)
Add 3.12 as experimental version to Release 2.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119705
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-02-12 20:13:54 +00:00
suo
82248f0b1c [export] improve FakeTensor serialization (#119531)
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.

The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.

This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.

This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.

As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).

Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
2024-02-12 19:28:08 +00:00
482345d747 Refactor out shape test into InputMetadata::maybe_reduce (#119559)
I'm going to gut this function shortly, and having it all on
InputMetadata is convenient for this purpose.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119559
Approved by: https://github.com/soulitzer
2024-02-12 19:27:48 +00:00
c24b74efc7 Revert "Optimize multi_tensor_apply (#119153)"
This reverts commit 24be7daf799ed94e1964e2ce440ccaad15962719.

Reverted https://github.com/pytorch/pytorch/pull/119153 on behalf of https://github.com/yifuwang due to This PR is breaking cuda graph for multi_tensor_apply ([comment](https://github.com/pytorch/pytorch/pull/119153#issuecomment-1939365823))
2024-02-12 19:11:29 +00:00
8d8fb9783c [MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681)
By using `view_as_real` when type casting between two complex types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-02-12 19:09:10 +00:00
eb0f9efd31 fix is_ and is_not (#118978)
Fix issue https://github.com/pytorch/pytorch/issues/118805

Note: this was a refresh PR of https://github.com/pytorch/pytorch/pull/118806
discussion there is relevant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118978
Approved by: https://github.com/lezcano
2024-02-12 19:04:40 +00:00
0e5b6594b7 [Dynamo] Minor cleanup of redundant function lookup logics (#119666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119666
Approved by: https://github.com/angelayi
2024-02-12 19:00:39 +00:00
ed20e9118b Fixed hash issue in fx_graph_cse (#119567)
Description:
- Fixed issue with hash collision for `hash((primals_2, 1.0)) == hash((primals_2, 1))`

Repro code:
```python
import torch
from torch._functorch.compile_utils import fx_graph_cse

def func(inpt, osize):
    size = inpt.shape[-1]
    s1 = size - 1
    s2 = size - 1.0
    scale = s2 / (osize - 1.0)
    inpt = torch.clamp(inpt, 0, s1)
    return scale * inpt

gms = []
def toy_backend(gm, _):
    gms.append(gm)
    return gm.forward

torch._dynamo.reset()
fn = torch.compile(backend=toy_backend, dynamic=True)(func)
t = torch.rand(3, 100)
out = fn(t, 50)
gm = gms[0]

print(gm.graph)
new_fx_g = fx_graph_cse(gm.graph)
print(str(new_fx_g))
```
Original graph
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```
New wrong graph where `sub_2` is replaced incorrectly with `sub`:
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=2] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```
With this PR the new graph is the following:
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119567
Approved by: https://github.com/eellison
2024-02-12 18:52:11 +00:00
27ffede878 [reland] Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #119102
2024-02-12 18:48:06 +00:00
b2043c0543 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-12 18:45:49 +00:00
893dcac068 [c10d] explicitly abort communicators in destroy_process_group call (#119250)
Summary:
This PR tries to resolve issue #119215.

Basically,  processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
2024-02-12 18:40:28 +00:00
31f00b0160 Clarify that legacy cat behavior only applies for 1-D tensor (#119684)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119684
Approved by: https://github.com/albanD
2024-02-12 18:13:04 +00:00
059bf1baa4 Separate clang lint? (#119575)
25 min -> 17 + 13 min, which is still not as fast as I want it to be but I'll take it
Lintrunner provides some parallelism by default, but it's not perfect

Reducing fetch-depth from all to 1 further reduces time by ~2-3 minutes

From non clang's logs:
```
2024-02-09T22:05:39.5297616Z Requirement already satisfied: PyYAML==6.0 in /opt/conda/lib/python3.11/site-packages (6.0)
2024-02-09T22:12:23.6164708Z Collecting black==23.12.1
```
I don't know why this part takes so long, maybe it's just buffering?  Clang version doesn't show this issue

See 5a750c8035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119575
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-02-12 17:46:31 +00:00
bc521f2ce3 In dynamo tracing for index() use None as the default indicator for end and not -1 (#119151)
Summary: In dynamo tracing, `index()`'s implementation currently has the default begin index as `0` and the default end index as`-1` which means that by default we're dropping the last element. Rather we should be doing `None` which will ensure that the last element is also checked.

Test Plan: CI

Differential Revision: D53392287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119151
Approved by: https://github.com/yanboliang
2024-02-12 17:45:05 +00:00
cf474a09f5 Decompose torch.ops.higher_order.auto_functionalized in Inductor (#118673)
We'd like to get auto_functionalized to work with AOTInductor. To get
there, we decompose `output = auto_functionalized(inplace_op, ...)` into its
corresponding aten ops (clones + inplace_op) before the Inductor lowering phase.

This decomposition must happen at the end of the Inductor FX passes
because it introduces in-place operations.

The pattern matcher's "replace this single node with multiple nodes" API
isn't robust enough here. The problem is that `auto_functionalized`
returns a single output (this output is a List), but the decomposition
ends up returning the unpacked List (e.g. it may return two tensors).
Previously, there was an assertion that this was not the case; I fixed
up `replace_with_graph` to handle this.

Future: Not all of the clones are necessary (e.g. if the input's last
usage is this operator, then we don't need to clone it). We can add this
logic later.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118673
Approved by: https://github.com/oulgen
2024-02-12 17:30:01 +00:00
8069b29603 [export] Implement logging for scuba. (#119585)
Summary: As we're growing the user surface of torch.export, we'd like to understand better how people are using our APIs. It's also possible to analyze the usages based on static analysis, but due to the fact that there could be many creative ways to call things in Python, I think just building some logging infra will benefit us in the short term and gain us some insights.

Test Plan:
buck test caffe2/test:test_export
{F1454519846}

Reviewed By: tugsbayasgalan

Differential Revision: D53618220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119585
Approved by: https://github.com/avikchaudhuri
2024-02-12 17:28:14 +00:00
757201c213 Refactor ExportedProgram to expose the functions for pre and postprocessing (#119513)
Reason:
Consumers of ExportProgram might choose to further lower exported_program.graph_module
to something else.
Then, it will need to setup the calling convention to call it.

This refactor concentrates these calling convention to one place and can be reused.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119513
Approved by: https://github.com/zhxchen17
2024-02-12 17:22:27 +00:00
72d9a38118 add get_function to TorchInGraphFunctionVariable (#119314)
partially address https://github.com/pytorch/pytorch/issues/118785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119314
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-02-12 16:35:34 +00:00
1c1dc0e4e0 [sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296)
Summary:

Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with
cuSPARSELt.

Test Plan:

```
python test/test_sparse_semi_structured.py -k mixed
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296
Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic
2024-02-12 16:02:36 +00:00
5f69d95b2b Enable x86 CPU vectorization on windows (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-02-12 16:01:30 +00:00
52a3de6cbf [AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392)
Summary: Move common functionality into a separate header so that later JIT and AOT Inductor can share it.

Test Plan: CI

Differential Revision: D53523452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119392
Approved by: https://github.com/khabinov
2024-02-12 15:56:16 +00:00
24bdd03d23 Revert "Reify view_func() closures as ViewFuncs (#118404)"
This reverts commit d5a6762263a98e5153bc057c8ba4f377542c7e55.

Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))
2024-02-12 12:38:51 +00:00
79df897608 Fix some tests in test_c10d_functional_native.py (#119102)
Summary:
This PR fixes a few tests that were broken because `empty` became `empty_strided_cuda` in the generate code.

Also changed some _c10d_functional calls to funcol calls so add coverage to tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119102
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-02-12 09:28:18 +00:00
0342b227e5 Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)"
This reverts commit f3e7d809936d9f1bf63102e8afe241e13ed8766a.

Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))
2024-02-12 07:34:20 +00:00
cyy
8a3c241094 Remove unused header inclusion (#119667)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667
Approved by: https://github.com/Skylion007
2024-02-12 05:36:25 +00:00
dcb08a7044 Add CUDAEvent recording for constant folding to show up. (#119216)
Summary: Add a layer of call to let CUDAEvent show up for constant folding.

Test Plan: Existing tests

Differential Revision: D53437934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119216
Approved by: https://github.com/khabinov
2024-02-12 03:46:36 +00:00
bc4d0277cd [executorch hash update] update the pinned executorch hash (#119648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119648
Approved by: https://github.com/pytorchbot
2024-02-12 03:42:07 +00:00
76fac69577 add a couple more cases to pointwise_cat perf tests (#119521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119521
Approved by: https://github.com/ezyang, https://github.com/eellison
2024-02-12 03:41:08 +00:00
647564dbaa Implement conditional statements in kernel analysis (#119664)
This PR makes it so that ops is no longer a dict of RET => OP but rather it is now RET => List[OP] since now multiple OPs can return the same RET. In real execution, only one of these OPs will be executed, so no need to worry about renaming. For analysis, we pessimistically assume any one of them could be executed (which is safest for analysis purposes)

Example TTIRs that can now be handled:
```
    scf.if %13 {
      %14 = tt.get_program_id y : i32 loc(#loc13)
      %c0_i32_1 = arith.constant 0 : i32 loc(#loc14)
      %15 = arith.cmpi eq, %14, %c0_i32_1 : i32 loc(#loc14)
      scf.if %15 {
        %16 = arith.addf %8, %11 : tensor<4xf32> loc(#loc16)
        %17 = tt.splat %arg2 : (!tt.ptr<f32, 1>) -> tensor<4x!tt.ptr<f32, 1>> loc(#loc17)
        %18 = tt.addptr %17, %4 : tensor<4x!tt.ptr<f32, 1>>, tensor<4xi32> loc(#loc17)
        tt.store %18, %16, %5 {cache = 1 : i32, evict = 1 : i32} : tensor<4xf32> loc(#loc18)
      } else {
      } loc(#loc15)
    } else {
    } loc(#loc12)
```

and

```
    %14 = scf.if %13 -> (tensor<4xf32>) {
      %17 = arith.addf %8, %11 : tensor<4xf32> loc(#loc13)
      scf.yield %17 : tensor<4xf32> loc(#loc13)
    } else {
      %17 = arith.mulf %8, %11 : tensor<4xf32> loc(#loc14)
      scf.yield %17 : tensor<4xf32> loc(#loc14)
    } loc(#loc12)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119664
Approved by: https://github.com/aakhundov
2024-02-12 01:54:26 +00:00
663dd5d006 [inductor] Update the compile options for CppPythonBindingsCodeCache (#119415)
Differential Revision: [D53554681](https://our.internmc.facebook.com/intern/diff/D53554681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119415
Approved by: https://github.com/jansel, https://github.com/khabinov
2024-02-11 21:25:34 +00:00
069581b3ca [BE] Properly mark destructor overrides (#119656)
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MTIAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~CUDAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MPSHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
```

 Likely introduced by https://github.com/pytorch/pytorch/pull/119329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
2024-02-11 21:07:16 +00:00
a4cc6b85dc [dynamo][eval][perf] Remove unnecessary dict copies. (#119305)
Both of these variables are already created using `dict(...)` so making yet another `dict` copy is pure overhead and boilerplate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119305
Approved by: https://github.com/Skylion007
2024-02-11 20:29:26 +00:00
e5f46a1d35 Check alignment of ReinterpretView args of custom Triton kernels (#119649)
Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x.

In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment.

Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view
...
----------------------------------------------------------------------
Ran 6 tests in 27.952s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649
Approved by: https://github.com/oulgen
2024-02-11 20:21:17 +00:00
b8e4423278 [torch][cuda][perf] Avoid unnecessary dicts. (#118011)
It's unnecessary and inefficient to create a `dict` from list indices to list values just to check if particular `idx` exists there. This way leads to `O(N)` time and space complexity whereas using `list` directly is `O(1)` time and space complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118011
Approved by: https://github.com/Skylion007
2024-02-11 19:29:24 +00:00
95a8d5b1bc [random] Replace for loop with list comprehension. (#119143)
It's more idiomatic and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119143
Approved by: https://github.com/Skylion007
2024-02-11 19:29:19 +00:00
4394e0dc2c [inductor] Use list comprehension to initialize unused_views. (#119618)
It's more idiomatic and efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119618
Approved by: https://github.com/Skylion007
2024-02-11 18:57:18 +00:00
24be7daf79 Optimize multi_tensor_apply (#119153)
### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119153
Approved by: https://github.com/janeyx99
2024-02-11 18:12:22 +00:00
2c91e13afc Add lowerings to special functions (#119187)
As in the title.

In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187
Approved by: https://github.com/peterbell10
2024-02-11 16:35:40 +00:00
4ee8aac432 [MPS] Enable bfloat16 support on MacOS 14 (#119641)
Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later

Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h`

TODO: Enable more testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641
Approved by: https://github.com/Skylion007
2024-02-11 16:25:29 +00:00
68e009dd8f [BE][EZ] Use dyspatch_sync_with_rethrow in searchsorted (#119646)
For the proper exception handling, otherwise raising C++ exception inside dispatch block will crash the app (discovered while enabling more BFloat16 ops)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119646
Approved by: https://github.com/Skylion007
2024-02-11 07:19:00 +00:00
6cd82253ae fix torch.set_float32_matmul_precision doc (#119620)
Fixes #119606, clearify the explictly stored number of bits in doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119620
Approved by: https://github.com/eqy, https://github.com/malfet
2024-02-11 06:41:37 +00:00
cyy
88183923d2 Remove unneeded linking of torch_shm_manager in CMake (#119540)
This PR aims to clean up torch_shm_manager dependency in CMake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119540
Approved by: https://github.com/ezyang
2024-02-11 06:33:35 +00:00
0bed0501fa Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634)
Summary: There has been some empirical evidence that, for (non-trivial) custom (user-written) Triton kernels, a register-spilling config yields the best result in auto-tuning. For this reason, we don't skip register-spilling config from auto-tuning of the custom Triton kernels.

<details>
<summary>An example of auto-tuning result with the register-spilling config outperforming others</summary>

```
BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.748896, nreg 255, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.723424, nreg 249, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 2.202656, nreg 190, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.748256, nreg 255, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.724896, nreg 249, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 2.201632, nreg 190, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.651664, nreg 255, nspill 56, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.846368, nreg 255, nspill 14, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.841792, nreg 243, nspill 0, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.651584, nreg 255, nspill 56, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.846432, nreg 255, nspill 14, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.841904, nreg 243, nspill 0, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.236448, nreg 255, nspill 254, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.484384, nreg 255, nspill 174, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.131168, nreg 255, nspill 6, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.236544, nreg 255, nspill 254, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.483648, nreg 255, nspill 174, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.131408, nreg 255, nspill 6, #shared-mem 22528
BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.516112, nreg 255, nspill 28, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.737792, nreg 255, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.411632, nreg 193, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.515904, nreg 255, nspill 28, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.736608, nreg 255, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.409808, nreg 193, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.553536, nreg 255, nspill 130, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569792, nreg 255, nspill 56, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.892448, nreg 255, nspill 4, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.553584, nreg 255, nspill 130, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569568, nreg 255, nspill 56, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.892240, nreg 255, nspill 4, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.332928, nreg 255, nspill 366, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.922256, nreg 255, nspill 228, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.758400, nreg 255, nspill 26, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.333440, nreg 255, nspill 366, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.922336, nreg 255, nspill 228, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.758496, nreg 255, nspill 26, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.231648, nreg 255, nspill 292, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.639424, nreg 255, nspill 90, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.917952, nreg 240, nspill 0, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.230624, nreg 255, nspill 292, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.639168, nreg 255, nspill 90, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.917440, nreg 240, nspill 0, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.838080, nreg 255, nspill 354, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569184, nreg 255, nspill 178, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.614720, nreg 255, nspill 28, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.838048, nreg 255, nspill 354, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569472, nreg 255, nspill 178, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.615104, nreg 255, nspill 28, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.012128, nreg 255, nspill 522, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.861536, nreg 255, nspill 378, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.771584, nreg 255, nspill 134, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.012512, nreg 255, nspill 522, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.861024, nreg 255, nspill 378, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.771712, nreg 255, nspill 134, #shared-mem 40960
```

</details>

In the above, the winning config is `BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2`, although it has non-zero `nspill 28`. This is an example where we need to consider all configs, including the register-spilling ones, to obtain the best result from auto-tuning.

In the worst case, this will just make auto-tuning longer, but can't regress the results. And, as the number of custom Triton kernels in the model is normally much smaller than the number of Inductor-generated ones, this should be acceptable.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119634
Approved by: https://github.com/oulgen
2024-02-11 02:13:25 +00:00
3ab08946d5 Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448)"
This reverts commit 0597dab523c0a341e136452a8f723f12700164c0.

Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))
2024-02-10 23:04:36 +00:00
d8e319a961 Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491)"
This reverts commit 760056bbdc552314e7e81adc45e11766ac0f333c.

Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))
2024-02-10 23:02:05 +00:00
6db6a1b526 [aten] Use emplace instead of insert. (#119614)
this avoids pair construction in case inserted key is already present in dict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119614
Approved by: https://github.com/Skylion007
2024-02-10 22:35:00 +00:00
2c8722182e [dynamo][guards] Avoid unnecessary stack copies. (#119115)
There is no need to make a `frame_summary_stack` copy in case it's not modified. Proposed change uses copy-on-write functional approach that is easy to understand and is more efficient in case `self.loc_in_frame` is `None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119115
Approved by: https://github.com/Skylion007
2024-02-10 21:56:00 +00:00
cyy
568740f080 [DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545)
Follows #119142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545
Approved by: https://github.com/ezyang
2024-02-10 20:27:59 +00:00
57d8f67619 [Dynamo][17/N] Rename SkipFilesVariable to SkipFunctionVariable and move to functions.py (#119619)
This is follow-up-3 from https://github.com/pytorch/pytorch/pull/118971#issue-2114082018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119619
Approved by: https://github.com/jansel
2024-02-10 19:33:37 +00:00
dcce5327bb [core][perf] Use set comprehensions in _RecreateLookupTables. (#119617)
It's more idiomatic and much more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119617
Approved by: https://github.com/Skylion007
2024-02-10 18:53:25 +00:00
c5116d9e44 Fix optim.lr_scheduler examples in doc to use optimizer vs self.opt (#119563)
Fixes #119561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119563
Approved by: https://github.com/janeyx99
2024-02-10 15:10:43 +00:00
34db6f1b13 Revert "make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500)"
This reverts commit 095f4713077639f0e48fa33d051c0de2eb1f8525.

Reverted https://github.com/pytorch/pytorch/pull/119500 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119500#issuecomment-1937003082))
2024-02-10 13:06:30 +00:00
c0f1183eb4 [inductor] Fix compile error on scan with no mask (#119555)
Fixes #119591

Currently this results in invalid syntax:
```python
tmp4 = tl.where(, tmp1, tmp2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119555
Approved by: https://github.com/lezcano
2024-02-10 12:38:40 +00:00
e71c202520 Use CUDA if cuda's macro is set for AOTI runner's pybind (#119616)
Summary:
Use CUDA if cuda's macro is set for AOTI runner's pybind
This is a duplicate of #119438 for landing issues

Test Plan:
Existing tests (D52303882)

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119616
Approved by: https://github.com/khabinov
2024-02-10 11:00:47 +00:00
3581428ea0 Do not mark tt.load's arguments as mutated (#119631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119631
Approved by: https://github.com/aakhundov
ghstack dependencies: #119581, #119615
2024-02-10 08:46:50 +00:00
6c5bf5a5ce Implement kernel analysis for functions with multiple return values (#119615)
This diff adds few improvements:

* Parsing for multiple return value: `tt.return %1, %arg0`
* Parsing for assignment for multiple values: `%1:2` means %1 has two values
* Parsing for usage of a value with multiple values: `%1#0` means 0th index of %1
* Fixes a bug in memo-cycle detection when multiple tests are executed back to back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119615
Approved by: https://github.com/aakhundov
ghstack dependencies: #119581
2024-02-10 08:46:50 +00:00
e693089c7a [Dynamo] Refactor tensor methods handling (#119581)
Fixes part of #119128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119581
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-10 08:46:50 +00:00
699ae72f51 [DCP][state_dict] Fix the issue that get_state_dict/set_state_dict ignore the buffer (#119573)
get_state_dict and set_state_dict currently do not appropriately handle the
buffers. This PR fixes thie issue.

Fixes, https://github.com/pytorch/pytorch/issues/119535.

Differential Revision: [D53616762](https://our.internmc.facebook.com/intern/diff/D53616762/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119573
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-02-10 06:36:58 +00:00
a82c50793e [executorch hash update] update the pinned executorch hash (#119510)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119510
Approved by: https://github.com/pytorchbot
2024-02-10 03:40:34 +00:00
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
f2778e3874 [vision hash update] update the pinned vision hash (#119511)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119511
Approved by: https://github.com/pytorchbot
2024-02-10 03:22:13 +00:00
42ca82dfb1 [audio hash update] update the pinned audio hash (#119612)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119612
Approved by: https://github.com/pytorchbot
2024-02-10 03:22:06 +00:00
3278b4c557 be more consrevative until regression is debugged (#119583)
See, internal regression: https://www.internalfb.com/diff/D53375778?transaction_fbid=953511712782168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119583
Approved by: https://github.com/Chillee
2024-02-10 03:06:58 +00:00
70a364d402 non-strict improvements: constant args and kwargs (#119529)
This PR makes a couple of improvements to non-strict to bring it closer to strict. (This lets us remove some expected failures from test_export.)

1. Support constant arguments (easy).
2. Support keyword arguments. This forces us to add kwargs to `aot_export_module`. Indeed there is no way to make this work otherwise, because some arguments in a function signature can be keyword-only and thus cannot be simulated by positional arguments alone. Adding kwargs to `aot_export_module` turns out to be fairly routine, but there is a bit of a unsatisfactory fork between how it is called by strict and non-strict: because strict calls it on a graph module, kwargs must be converted to positional arguments. So kwargs in `aot_export_module` really only comes into play in non-strict.

Differential Revision: D53600977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119529
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-02-10 02:55:40 +00:00
760056bbdc [aot_inductor] move CppWrapperCodeGen into a separate file (#119491)
This PR moved CppWrapperCodeGen class into a seperate file,
cpp_wrapper.py, to simplify wrapper.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491
Approved by: https://github.com/desertfire, https://github.com/albanD
2024-02-10 02:15:56 +00:00
095f471307 make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500)
`dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous.

Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this:
```
    grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2)
    grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    return grad_q, grad_k, grad_v
```

But (I think?) the logic in the sdpa backward impl was a typo.

I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523).

A minimal repro that I made looks like this:
```
import torch

# in this repro, "grad_out" and "value" are transposed tensors,
# but "key" and "value" are contiguous
a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
f = torch.randn(2, 16, 513, device='cuda')
g = None
h = None
i = 513
j = 513
k = 0.0
l = False
m = torch.tensor(1, dtype=torch.int64)
n = torch.tensor(1, dtype=torch.int64)

out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

from torch._meta_registrations import meta__scaled_dot_product_flash_backward
out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

# prints True True
print(out1_ref.is_contiguous())
print(out1_test.is_contiguous())

# prints True True
print(out2_ref.is_contiguous())
print(out2_test.is_contiguous())

# prints True False
print(out3_ref.is_contiguous())
print(out3_test.is_contiguous())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007
2024-02-10 02:04:56 +00:00
e1c1b8c2b2 [dynamo] Improve support for backwards hooks (#119525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-02-10 01:14:03 +00:00
cyy
05602915f5 Link torch_cpu to cudart only if CUPTI is enabled (#118232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118232
Approved by: https://github.com/ezyang
2024-02-10 00:53:51 +00:00
44796682d0 [torch][ao] Fix module name filter for pytorch2 quantization for underscores (#119344)
Summary:
There was a bug in the module name filter for modules that had an underscore
already in them, as it was replaced with a "dot" notation.
This is because it was thought that underscores always meant a module separator,
but this isn't the case for modules whose name contains an underscore.

Test Plan:
Added a unit test. Before this change, that test failed (due to applying the wrong
qscheme). Now it passes.

Differential Revision: D53502771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119344
Approved by: https://github.com/jerryzh168
2024-02-10 00:29:08 +00:00
34f7dc9eba [ONNX] Support op consistency error reproduction (#119512)
Fixes #119472

Introduce the debugging tool in onnxscript: https://github.com/microsoft/onnxscript/blob/main/onnxscript/tests/function_libs/torch_lib/error_reproduction.py

This tool can help us quickly find the inputs leading to mismatched errors.

NOTE: this produces `error_reports` folder where there are different markdown reports for each mismatched test cases.

For example - CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool

### Summary

The output of ONNX Runtime does not match that of PyTorch when executing test
`test_fx_op_consistency.TestOnnxModelOutputConsistency_opset_version_18_model_type_TorchModelType.TORCH_NN_MODULECPU.test_output_match_fft_fft_cpu_bool`, `sample 3` in ONNX Script `TorchLib`.

To recreate this report, use

```bash
CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool
```

### ONNX Model

```
<
   ir_version: 8,
   opset_import: ["pkg.onnxscript.torch_lib" : 1, "" : 18, "pkg.onnxscript.torch_lib.common" : 1],
   producer_name: "pytorch",
   producer_version: "2.2.0"
>
main_graph (bool[31] l_args_0_) => (float[31,2] _fft_r2c)
   <bool[31] l_args_0_, float[31] _to_copy, float[31,2] _fft_r2c>
{
   _to_copy = Cast <to: int = 1> (l_args_0_)
   _val_2 = Constant <value: tensor = int64[1] {-1}> ()
   _val_3 = Unsqueeze (_to_copy, _val_2)
   _val_4 = Constant <value: tensor = int64[1] {0}> ()
   _val_5 = Unsqueeze (_val_3, _val_4)
   _val_6 = DFT <axis: int = 1, inverse: int = 0, onesided: int = 0> (_val_5)
   _val_7 = Constant <value: tensor = int64[1] {0}> ()
   _val_8 = Squeeze (_val_6, _val_7)
   _fft_r2c = pkg.onnxscript.torch_lib._fftn_onnx_normalization <dims: ints = [0], forward: int = 1, normalization: int = 0> (_val_3, _val_8)
}
<
  domain: "pkg.onnxscript.torch_lib",
  opset_import: ["" : 18]
>
_fftn_onnx_normalization <normalization,forward,dims>(self, transformed) => (result_15)
{
   self_shape = Shape (self)
   dims = Constant <value_ints: ints = @dims> ()
   self_shape_subscripted = Gather <axis: int = 0> (self_shape, dims)
   total_sample_count = ReduceProd <keepdims: int = 0> (self_shape_subscripted)
   total_sample_count_0 = CastLike (total_sample_count, transformed)
   normalization = Constant <value_int: int = @normalization> ()
   int64_1 = Constant <value: tensor = int64 int64_1 {1}> ()
   cond = Equal (normalization, int64_1)
   result_15 = If (cond) <then_branch: graph = thenGraph_21 () => ( result_3) {
      forward = Constant <value_int: int = @forward> ()
      forward_as_bool = Cast <to: int = 9> (forward)
      result_3 = If (forward_as_bool) <then_branch: graph = thenGraph_23 () => ( result) {
         tmp = Sqrt (total_sample_count_0)
         result = Div (transformed, tmp)
      }, else_branch: graph = elseGraph_23 () => ( result_2) {
         tmp_1 = Sqrt (total_sample_count_0)
         result_2 = Mul (transformed, tmp_1)
      }>
   }, else_branch: graph = elseGraph_21 () => ( result_14) {
      normalization_4 = Constant <value_int: int = @normalization> ()
      int64_2 = Constant <value: tensor = int64 int64_2 {2}> ()
      cond_5 = Equal (normalization_4, int64_2)
      result_14 = If (cond_5) <then_branch: graph = thenGraph_27 () => ( result_9) {
         forward_6 = Constant <value_int: int = @forward> ()
         forward_6_as_bool = Cast <to: int = 9> (forward_6)
         result_9 = If (forward_6_as_bool) <then_branch: graph = thenGraph_29 () => ( result_7) {
            result_7 = Div (transformed, total_sample_count_0)
         }, else_branch: graph = elseGraph_29 () => ( result_8) {
            result_8 = Identity (transformed)
         }>
      }, else_branch: graph = elseGraph_27 () => ( result_13) {
         forward_10 = Constant <value_int: int = @forward> ()
         forward_10_as_bool = Cast <to: int = 9> (forward_10)
         result_13 = If (forward_10_as_bool) <then_branch: graph = thenGraph_35 () => ( result_11) {
            result_11 = Identity (transformed)
         }, else_branch: graph = elseGraph_35 () => ( result_12) {
            result_12 = Mul (transformed, total_sample_count_0)
         }>
      }>
   }>
}
<
  domain: "pkg.onnxscript.torch_lib.common",
  opset_import: ["" : 18]
>
Rank (input) => (return_val)
{
   tmp = Shape (input)
   return_val = Size (tmp)
}
<
  domain: "pkg.onnxscript.torch_lib.common",
  opset_import: ["" : 18]
>
IsScalar (input) => (return_val)
{
   tmp = Shape (input)
   tmp_0 = Size (tmp)
   tmp_1 = Constant <value_int: int = 0> ()
   return_val = Equal (tmp_0, tmp_1)
}
```

### Inputs

Shapes: `['Tensor<torch.Size([31]), dtype=torch.bool>']`

<details><summary>Details</summary>
<p>

```python
kwargs = {}
inputs = (tensor([False, False,  True,  True, False,  True, False,  True, False, False,
         True, False, False, False, False, False,  True,  True,  True,  True,
         True,  True,  True,  True, False, False, False, False,  True,  True,
         True]),)
```

</p>
</details>

### Expected output

Shape: `torch.Size([31, 2])`

<details><summary>Details</summary>
<p>

```python
expected = tensor([[16.0000,  0.0000],
        [-0.2369,  2.6590],
        [ 0.7336, -4.9670],
        [ 2.2093,  2.9865],
        [-0.7166,  1.0928],
        [-3.0614,  3.0015],
        [-1.8945, -0.9677],
        [-2.1538,  0.2513],
        [-2.2432,  1.3978],
        [-0.3429,  1.9494],
        [-0.6495, -1.5423],
        [-0.6005,  2.2398],
        [ 2.2639,  2.6430],
        [ 1.7609,  0.2033],
        [-1.3829, -2.3365],
        [-1.6854, -0.0311],
        [-1.6854,  0.0311],
        [-1.3829,  2.3365],
        [ 1.7609, -0.2033],
        [ 2.2639, -2.6430],
        [-0.6005, -2.2398],
        [-0.6495,  1.5423],
        [-0.3429, -1.9494],
        [-2.2432, -1.3978],
        [-2.1538, -0.2513],
        [-1.8945,  0.9677],
        [-3.0614, -3.0015],
        [-0.7166, -1.0928],
        [ 2.2093, -2.9865],
        [ 0.7336,  4.9670],
        [-0.2369, -2.6590]])
```

</p>
</details>

### Actual output

Shape: `torch.Size([31, 2])`

<details><summary>Details</summary>
<p>

```python
actual = tensor([[ 1.6000e+01, -9.1791e-06],
        [-2.3695e-01,  2.6590e+00],
        [ 7.3355e-01, -4.9670e+00],
        [ 2.2093e+00,  2.9865e+00],
        [-7.1663e-01,  1.0928e+00],
        [-3.0614e+00,  3.0015e+00],
        [-1.8946e+00, -9.6773e-01],
        [-2.1538e+00,  2.5126e-01],
        [-2.2432e+00,  1.3978e+00],
        [-3.4294e-01,  1.9494e+00],
        [-6.4946e-01, -1.5423e+00],
        [-6.0044e-01,  2.2398e+00],
        [ 2.2639e+00,  2.6430e+00],
        [ 1.7609e+00,  2.0326e-01],
        [-1.3829e+00, -2.3365e+00],
        [-1.6854e+00, -3.1130e-02],
        [-1.6854e+00,  3.1161e-02],
        [-1.3829e+00,  2.3365e+00],
        [ 1.7609e+00, -2.0327e-01],
        [ 2.2639e+00, -2.6430e+00],
        [-6.0047e-01, -2.2398e+00],
        [-6.4945e-01,  1.5423e+00],
        [-3.4294e-01, -1.9494e+00],
        [-2.2432e+00, -1.3978e+00],
        [-2.1538e+00, -2.5129e-01],
        [-1.8945e+00,  9.6773e-01],
        [-3.0615e+00, -3.0015e+00],
        [-7.1663e-01, -1.0928e+00],
        [ 2.2093e+00, -2.9865e+00],
        [ 7.3354e-01,  4.9670e+00],
        [-2.3695e-01, -2.6589e+00]])
```

</p>
</details>

### Difference

<details><summary>Details</summary>
<p>

```diff
--- actual
+++ expected
@@ -1,31 +1,31 @@
-tensor([[ 1.6000e+01, -9.1791e-06],
-        [-2.3695e-01,  2.6590e+00],
-        [ 7.3355e-01, -4.9670e+00],
-        [ 2.2093e+00,  2.9865e+00],
-        [-7.1663e-01,  1.0928e+00],
-        [-3.0614e+00,  3.0015e+00],
-        [-1.8946e+00, -9.6773e-01],
-        [-2.1538e+00,  2.5126e-01],
-        [-2.2432e+00,  1.3978e+00],
-        [-3.4294e-01,  1.9494e+00],
-        [-6.4946e-01, -1.5423e+00],
-        [-6.0044e-01,  2.2398e+00],
-        [ 2.2639e+00,  2.6430e+00],
-        [ 1.7609e+00,  2.0326e-01],
-        [-1.3829e+00, -2.3365e+00],
-        [-1.6854e+00, -3.1130e-02],
-        [-1.6854e+00,  3.1161e-02],
-        [-1.3829e+00,  2.3365e+00],
-        [ 1.7609e+00, -2.0327e-01],
-        [ 2.2639e+00, -2.6430e+00],
-        [-6.0047e-01, -2.2398e+00],
-        [-6.4945e-01,  1.5423e+00],
-        [-3.4294e-01, -1.9494e+00],
-        [-2.2432e+00, -1.3978e+00],
-        [-2.1538e+00, -2.5129e-01],
-        [-1.8945e+00,  9.6773e-01],
-        [-3.0615e+00, -3.0015e+00],
-        [-7.1663e-01, -1.0928e+00],
-        [ 2.2093e+00, -2.9865e+00],
-        [ 7.3354e-01,  4.9670e+00],
-        [-2.3695e-01, -2.6589e+00]])
+tensor([[16.0000,  0.0000],
+        [-0.2369,  2.6590],
+        [ 0.7336, -4.9670],
+        [ 2.2093,  2.9865],
+        [-0.7166,  1.0928],
+        [-3.0614,  3.0015],
+        [-1.8945, -0.9677],
+        [-2.1538,  0.2513],
+        [-2.2432,  1.3978],
+        [-0.3429,  1.9494],
+        [-0.6495, -1.5423],
+        [-0.6005,  2.2398],
+        [ 2.2639,  2.6430],
+        [ 1.7609,  0.2033],
+        [-1.3829, -2.3365],
+        [-1.6854, -0.0311],
+        [-1.6854,  0.0311],
+        [-1.3829,  2.3365],
+        [ 1.7609, -0.2033],
+        [ 2.2639, -2.6430],
+        [-0.6005, -2.2398],
+        [-0.6495,  1.5423],
+        [-0.3429, -1.9494],
+        [-2.2432, -1.3978],
+        [-2.1538, -0.2513],
+        [-1.8945,  0.9677],
+        [-3.0614, -3.0015],
+        [-0.7166, -1.0928],
+        [ 2.2093, -2.9865],
+        [ 0.7336,  4.9670],
+        [-0.2369, -2.6590]])
```

</p>
</details>

### Full error stack

```
Tensor-likes are not close!

Mismatched elements: 21 / 62 (33.9%)
Greatest absolute difference: 3.719329833984375e-05 at index (26, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0005033136694692075 at index (15, 1) (up to 1.3e-06 allowed)
  File "/home/titaiwang/pytorch/test/onnx/test_fx_op_consistency.py", line 1763, in _compare_onnx_and_torch_exported_program
    torch.testing.assert_close(
  File "/home/titaiwang/pytorch/torch/testing/_comparison.py", line 1523, in assert_close
    raise error_metas[0].to_error(msg)

```

### Environment

```
OS: Linux-5.15.135.1-2.cm2-x86_64-with-glibc2.35
Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
onnx==1.15.0
onnxruntime==1.17.0
onnxscript==0.1.0.dev20240207
numpy==1.26.0
torch==2.2.0a0+git684ce1b
```
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119512
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-02-09 23:24:01 +00:00
bb287d73ec [ONNX] Apply modularizarion to exported program exporting (#119498)
Apply modularization pass to exported program exporting. The only two things that needs to be taken care of are (1) the extra call stack generated by `torch.export.export` and (2) lifted placeholder has call stack (different from original placeholder).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119498
Approved by: https://github.com/thiagocrepaldi
2024-02-09 22:57:42 +00:00
3372aa51b4 Integrate swap_tensors into nn.Module.load_state_dict (#117913)
Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors`
* `param.module_load(sd[key])`

This method can be overridden using `__torch_function__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913
Approved by: https://github.com/albanD
2024-02-09 22:32:29 +00:00
a7f82b7d62 [fix] tmp fix for import issue in dtensor (#119582)
a temporary fix for S394053 which is likely caused by backward incompatible `import` introduced in D53437243. It's yet to be understood why this may cause an issue but let's forward "fix" it first then draft a follow up diff for a right fix.

Differential Revision: [D53621345](https://our.internmc.facebook.com/intern/diff/D53621345/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119582
Approved by: https://github.com/tianyu-l
2024-02-09 20:50:27 +00:00
bf8db86a19 [FSDP] Added deprecation msg for NO_SHARD (#119553)
This only includes the warning for world size >1 since we clamp to `NO_SHARD` for world size 1. We mainly do not want `NO_SHARD` to proliferate anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119553
Approved by: https://github.com/Skylion007
2024-02-09 20:32:03 +00:00
f3e7d80993 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-09 20:23:20 +00:00
0597dab523 [aot_inductor] move CudaWrapperCodeGen into a separate file (#119448)
wrapper.py is getting more complex. Let's first split it
into smaller pieces. Will have another PR to move CppWrapperCodeGen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448
Approved by: https://github.com/desertfire
2024-02-09 20:18:04 +00:00
9a1df7cfd7 ReduceLROnPlateau init _last_lr (#119366) (#119556)
Fixes #119366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119556
Approved by: https://github.com/janeyx99
2024-02-09 19:35:02 +00:00
bf8a5a11be Fix Inductor CSE Across Separate Reductions (#119410)
We were CSE'ing a load across two separate reduction loop bodies. This is because we were examining an indirect indexing that did not have an explicit rindex in its load. I've commented with more details and other potentials on the fix.

Tried using minifier unsuccessfully and hand minified some but could do more..

Fix for https://github.com/pytorch/pytorch/issues/119327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119410
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-02-09 19:34:57 +00:00
f208795182 Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)
This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways:

* The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message.
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #117356
2024-02-09 19:15:28 +00:00
01e248d6f1 Fix FallbackKernel behavior on mutable ops (#118649)
FallbackKernel wasn't handing mutable ops correctly: it would not report
them in get_mutation_names or get_alias_names. This would lead to silent
incorrectness -- Inductor would incorrectly reorder the mutable op with other
mutable ops.

This PR fixes that:
- we only support mutable operations that are "auto_functionalizable".
  That is, they mutate inputs and do not return aliases of any inputs.
- Following the Triton kernel work, any mutated inputs must be specified
  in get_alias_names and processed via mark_node_as_mutating
- We also do some minor cleanup by killing dead code (FallbackKernel no
  longer processes OpOverloadPacket) and adding some handling around
  HOPs.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118649
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-02-09 19:01:54 +00:00
25a0fa6d13 Revert "[dynamo] Improve support for backwards hooks (#119525)"
This reverts commit b1f4b2a63c038f0090886d7d213825f39c283ea5.

Reverted https://github.com/pytorch/pytorch/pull/119525 on behalf of https://github.com/clee2000 due to broke test_autograd.py::TestAutograd::test_post_accumulate_grad_hook_gets_cleaned_up on dynamo https://github.com/pytorch/pytorch/actions/runs/7847212828/job/21416215820 b1f4b2a63c.  The failure exists on the PR as well, but got masked by the other test.  Putting this as no signal? ([comment](https://github.com/pytorch/pytorch/pull/119525#issuecomment-1936447169))
2024-02-09 18:58:55 +00:00
4b9568a360 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang
2024-02-09 18:54:28 +00:00
d5a6762263 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-09 18:51:36 +00:00
261f0138a2 [easy] Fix pass_manager type annotation (#119499)
Summary: passes are str not callable here.

Test Plan: lint

Reviewed By: frank-wei

Differential Revision: D53592166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119499
Approved by: https://github.com/22quinn, https://github.com/Skylion007
2024-02-09 18:39:43 +00:00
suo
5747ec24b4 [export] fix canonicalization for input mutations (#119533)
The comparison was off: user_input_mutation and buffer_mutation had the same numeric value, which led the comparison to move to the next element of the tuple and try to compare `None` to `spec.buffer_mutation.buffer_name`, which doesn't work. So make them different numbers.

Differential Revision: [D53601300](https://our.internmc.facebook.com/intern/diff/D53601300/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119533
Approved by: https://github.com/zhxchen17
2024-02-09 18:30:39 +00:00
cf42dd09ca [FSDP2] Replaced version-ctx with no_grad; removed no_grad (#119550)
This PR replaces the `_unsafe_preserve_version_counters` context with a simple `torch.no_grad()` context instead. This decreases CPU overhead from (1 context enter/exit + `N` loop over tensors) with just (1 context enter/exit).

This PR also removes a `torch.no_grad()` from `init_unsharded_param` as it helps compiling but does not affect eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119550
Approved by: https://github.com/Skylion007
2024-02-09 18:24:19 +00:00
f3a2094065 [Dynamo][Export] Mitigate legacy issue that aten op as export entrance function (#119528)
This is going to fix a legacy issue like:
```
torch._dynamo.export(torch.ops.aten.scaled_dot_product_attention, ...)(*inputs,)
```
This is not supported any more, now the top level ```torch.export``` only support ```nn.Module```, but there are still some tests using the internal APIs and caused the ```trace_rules.check``` assertion error. This PR is going to mitigate such cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119528
Approved by: https://github.com/ydwu4
2024-02-09 18:24:09 +00:00
5356b5d1f0 [Dynamo][16/N] Move skipfiles to trace_rules.py (#119432)
This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432
Approved by: https://github.com/jansel
2024-02-09 18:18:23 +00:00
7082e24ce8 [quant][pt2e][bc-breaking] Set fold_quantize to True in convert_pt2e (#119425)
Summary: This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to set `fold_quantize` flag to True in `convert_pt2e`

Test Plan: CI

Differential Revision: D53550237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119425
Approved by: https://github.com/andrewor14
2024-02-09 18:13:43 +00:00
3f82e435eb Fix delete branches (#119399)
Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch.  Instead, query separately for branches with the no-delete-branch label, which I created recently.

Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399
Approved by: https://github.com/huydhn
2024-02-09 17:28:00 +00:00
c6f39740c7 Revert "Fix delete branches (#119399)"
This reverts commit e1fc7e1ebcf4b87d5c34bf276806212c38ca00f0.

Reverted https://github.com/pytorch/pytorch/pull/119399 on behalf of https://github.com/clee2000 due to has a bug ([comment](https://github.com/pytorch/pytorch/pull/119399#issuecomment-1936291560))
2024-02-09 17:14:23 +00:00
53a6ab3fda [BE] Update Pillow to 10.2.0 (#119517)
As older versions have arbitrary code execution vulnerabilities Reported by Dependabot, documented in https://nvd.nist.gov/vuln/detail/CVE-2023-50447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119517
Approved by: https://github.com/kit1980, https://github.com/seemethere
2024-02-09 17:05:28 +00:00
b1f4b2a63c [dynamo] Improve support for backwards hooks (#119525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525
Approved by: https://github.com/yanboliang
2024-02-09 17:02:40 +00:00
5d6e323549 No TD (test removal) option in CI (#118808)
It currently doesn't do anything, but I will want these env vars later.  Maybe I should start using ghstack

Intention: --enable-td actually gets rid of tests

I am open to better names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-02-09 16:42:27 +00:00
e1fc7e1ebc Fix delete branches (#119399)
Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch.  Instead, query separately for branches with the no-delete-branch label, which I created recently.

Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399
Approved by: https://github.com/huydhn
2024-02-09 16:40:32 +00:00
5d81ade484 [Inductor max autotune] Multithreaded Precompilation (#119386)
When using the Cutlass backend, the compilation
of CUDA source files can totally dominate the runtime required for the benchmarking done
as part of Autotuning.

This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a
possible on-disk sccache ).

Also it ensures that no unneccessary compilation
and benchmarking steps are performed, which was peviously the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386
Approved by: https://github.com/aakhundov
2024-02-09 16:11:30 +00:00
173256424a Update setuptools to 68.2.2 (#119456)
Followup after itself: Anaconda does not have setuptools v65, but does v68
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456
Approved by: https://github.com/Skylion007
2024-02-09 15:38:25 +00:00
eff93fbd86 Revert "[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432)"
This reverts commit 56364124af8fe148ba8b0c935571ebae6500f33b.

Reverted https://github.com/pytorch/pytorch/pull/119432 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119432#issuecomment-1936122795))
2024-02-09 15:25:25 +00:00
90dabff260 Avoid COW materialize in various operations (#119506)
Operations affected include dot, cross, scatter/gather, shape, sort,
triangular, unary, scalar, pad, complex, to_list, fft

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119506
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502, #119503, #119504
2024-02-09 14:47:19 +00:00
8a09f1320c Avoid COW materialize in index, reduce, compare, unique, and copy ops (#119504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119504
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502, #119503
2024-02-09 14:47:19 +00:00
0e6b314fc2 Avoid performing replacements when it would unrefine ranges (#117356)
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.

This PR does the following:

* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
2024-02-09 14:43:58 +00:00
064610d8ac Don't guard if there are unbacked SymInts (#119312)
Fixes https://github.com/pytorch/pytorch/issues/119309

Not sure how to write the test.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119312
Approved by: https://github.com/lezcano
2024-02-09 11:02:47 +00:00
a13bb9f6a8 Add symbol_guard_limit_before_specialize (#119347)
Add a flag setting that controls a threshold of guards involving a symbol, after which we force a symbol to be specialized. The roll out plan is to enable this on OSS but not fbcode, and then roll out to fbcode after we get some telemetry from the previous PR.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119347
Approved by: https://github.com/lezcano
2024-02-09 08:44:37 +00:00
a050d146b7 [Inductor] Add Int8 data type into Inductor CPP backend vectorized code generation (#119179)
**Summary**
Part 1 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type.
In the current implementation for quantization, the vectorized code generation only supports the `uint8` data type. In this PR, we introduce support for the `int8` data type within the vectorized code generation.

**TestPlan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_dequant_relu_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_quant_lowering_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_maxpool2d_lowering_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_per_tensor_fake_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_non_contiguous_load_buf_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering_int8
```

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119179
Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/jansel
2024-02-09 07:33:12 +00:00
5918622d72 Avoid COW materialize in pooling, batch linalg, upsample, softmax ops (#119503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119503
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502
2024-02-09 06:52:16 +00:00
53deddd66d Avoid COW materialization for TensorInfo with const type (#119502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119502
Approved by: https://github.com/ezyang
ghstack dependencies: #119501
2024-02-09 06:51:43 +00:00
fba5b7f7c8 Avoid COW materialization for TensorAccessors with const type (#119501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119501
Approved by: https://github.com/ezyang
2024-02-09 06:46:00 +00:00
fa071a2e1b Clarifying windows cosine behaviour in the documentation (#119444)
After following the discussion, I've created a PR to update the documentation clarifying the function's behaviour (@tqbl solution 1).

Fixes #110541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119444
Approved by: https://github.com/malfet
2024-02-09 05:57:44 +00:00
0f2fbbff10 Enable fake tensor caching in fbcode by default (#118555)
Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too.

Test Plan: Ran torchbench benchmarks in fbcode

Differential Revision: [D53189048](https://our.internmc.facebook.com/intern/diff/D53189048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555
Approved by: https://github.com/eellison
2024-02-09 05:42:16 +00:00
2cdf9b7674 [BE] Update requests to 2.31.0 (#119516)
Fixes potential memory leak detected by DepandaBot and reported in  https://nvd.nist.gov/vuln/detail/CVE-2023-32681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119516
Approved by: https://github.com/kit1980, https://github.com/seemethere
2024-02-09 05:10:16 +00:00
458e83b5b3 Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186)"
This reverts commit 113506d2d4a0120e912c8f36e70a621f55378f81.

Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/atalman due to Reverted Internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1935310344))
2024-02-09 04:19:20 +00:00
930b60f5aa Add Debug Utility To Generate Inputs for AOT Graphs (#119409)
```
    Takes in a function which has been printed with print_readable() and constructs kwargs to run it.
    Currently only handles Tensor inputs and a graph module which might have tensor constants.
    Example:
        Consider a function `forward` defined as follows:
        >>> def forward(self, primals_1: "f32[1001, 6]"):
        ...     _tensor_constant0: "i64[4190]" = self._tensor_constant0
        ...     # Further implementation
        >>> kwargs = aot_graph_input_parser(forward)
        >>> forward(**kwargs)
    """
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119409
Approved by: https://github.com/shunting314
2024-02-09 03:55:19 +00:00
2d474e17cb Don't log canonicalized expressions (#119471)
Fixes https://github.com/pytorch/pytorch/issues/119467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119471
Approved by: https://github.com/ezyang
2024-02-09 02:46:11 +00:00
8994f2367d Revert "Fix jagged NT softmax semantics (#119459)"
This reverts commit 6adadbaf7943f760ea2375619b1783020b69d4e6.

Reverted https://github.com/pytorch/pytorch/pull/119459 on behalf of https://github.com/malfet due to broke dynamo, see https://github.com/pytorch/pytorch/actions/runs/7835402753/job/21386634602 ([comment](https://github.com/pytorch/pytorch/pull/119459#issuecomment-1935246413))
2024-02-09 02:31:49 +00:00
88429a8084 [inductor] Add split scan kernel (#117992)
This PR adds a new type of triton kernel in which data is persistent but the
reduction dimension is split over multiple blocks (up to the entire kernel).
though this is called a reduction dimension, in actuality we only support scans.
because of this limitation, i have to be able to block fusions of split scan
operations with reductions so chose to add a new `ir.SplitScan` node which
is identical but allows for differentiation in the scheduler.

The split scan kernel is also the first to require an additional workspace buffer
which is used to communicate between cuda blocks. this is slightly tricky as we
the exact scratch space requirement isn't known until the grid size is calculated.
here i workaround the issue by setting a minimum rblock size and always allocating
to the maximum possible grid size for a given input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992
Approved by: https://github.com/jansel
ghstack dependencies: #117991
2024-02-09 01:56:00 +00:00
01edb8a559 [inductor] Refactor triton range_tree handling (#117991)
Currently the dimension handling in triton kernels has various special cases e.g.
- handling "r" for non-reduction vs persistent reduction vs non-persistent reduction.
- handling "x" when `no_x_dim` is set

This adds three new properties to the range tree objects which capture the
same information in a more generic way:
- `is_loop`: true for the "r" dimension of a non-persistent reduction
- `tensor_dim`: Optional index of the triton tensor dimension
- `grid_dim`: Optional index of the triton grid dimension

The motivation here is I want to add a new split scan kernel type which is:
- not a persistent reduction, yet has `is_loop=False` for the "r" dimension
- Has a `grid_dim` for the "r" dimension

These flags now only need to be set once in `initialize_range_trees`, instead of having
to infer them throughout the code based on the tree prefix and various other kernel flags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991
Approved by: https://github.com/lezcano
2024-02-09 01:56:00 +00:00
6efda849b5 Update chunk_dtensor to support HYBRID_SHARD (#119481)
Fixes https://github.com/pytorch/pytorch/issues/118639.

Adds support to replicate across HSDP dimensions instead of sharding for shard placement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119481
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-02-09 01:30:53 +00:00
454abb6b99 Disable tests that use bfloat 16 for SM < 80 (#118449)
```
`torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-83b319, line 51; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 51; error   : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 59; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 59; error   : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 65; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 65; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
ptxas fatal   : Ptx assembly aborted due to errors
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To execute this test, run the following from the base repo dir:
     python test/inductor/test_torchinductor.py -k test_bfloat16_to_int16_cuda`
```

Fixed test failure that uses bfloat 16 on pre SM80 (V100 is where the test failure is seen for this test)

See also #113384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118449
Approved by: https://github.com/eqy, https://github.com/peterbell10
2024-02-09 01:27:22 +00:00
915f9db03c [Dynamo] Support kwargs for lazy module (#119445)
Summary:
Seems like `kwargs` is already support in `_infer_argument`, so we don't need the extra assertion `len(kwargs) == 0`.

This optimization ensures compatibility with torch.compile() for LazyModules with kwargs inputs, preventing graph breaks.

Test Plan: Unit tetst and CI

Differential Revision: D53558778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119445
Approved by: https://github.com/yanboliang
2024-02-09 00:46:41 +00:00
45c4a0ce9d Update setup tools to 65.5.1 (#119456)
Should some dependabot  alerts by:
- Updating setupttols to 65.5.1
- Updating jinja2 to 3.3.1

TODO:
 - Update jinja2 and sphinx for the docs builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456
Approved by: https://github.com/Skylion007
2024-02-08 23:34:41 +00:00
a8d1645f15 Revert "Add lowering for logcumsumexp (#118753)"
This reverts commit 5a77ee65879b58e99911fd53d92ddb55a1c234eb.

Reverted https://github.com/pytorch/pytorch/pull/118753 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but not seen until trunk job ([comment](https://github.com/pytorch/pytorch/pull/118753#issuecomment-1935074235))
2024-02-08 23:10:33 +00:00
cyy
560c92c324 [DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142
Approved by: https://github.com/ezyang
2024-02-08 23:00:56 +00:00
e98dbae0a0 [ROCm] enable hipsolver backend for linalg.eigh (#115177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115177
Approved by: https://github.com/lezcano
2024-02-08 22:03:27 +00:00
suo
0f12c0af44 [export] allow user input mutation in aot_export (#119356)
This PR enables input mutation in aot_export by removing the guard and ensuring that the GraphSignature is properly wired up.

This allows to undo the gross hack in torch.export where we lift user inputs to buffers in order to get around aot_export upstream support. It also makes input mutation work properly for non-strict mode.

Mutations on inputs that require_grad are still banned (I added a test for a non-parameter input as well, just to make sure).

Differential Revision: [D53507440](https://our.internmc.facebook.com/intern/diff/D53507440/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119356
Approved by: https://github.com/bdhirsh, https://github.com/zhxchen17, https://github.com/titaiwangms
2024-02-08 22:02:24 +00:00
9f8ade04cc [aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220)
In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.

If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the ```assert``` macro from cassert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220
Approved by: https://github.com/hl475, https://github.com/desertfire
2024-02-08 21:57:27 +00:00
71e772f827 Update logging.cpp for explicit chrono import (#119469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119469
Approved by: https://github.com/davidberard98
2024-02-08 21:57:23 +00:00
45e7af5818 Windows Dynamo Error Removal CI Check (#115969)
Rebase of #111313 onto `main`, for CI validation

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969
Approved by: https://github.com/ezyang
2024-02-08 21:23:45 +00:00
0827510fd3 [export] Remove torch._export.export (#119095)
XLA changes: https://github.com/pytorch/xla/pull/6486

Test Plan: CI

Differential Revision: D53316196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168
2024-02-08 21:22:04 +00:00
a7754b2b60 [dtensor] switch softmax backward ops to OpStrategy (#119255)
As titled. This is a followup to PR #117723 on softmax forward ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119255
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2024-02-08 21:18:39 +00:00
d9a1b25807 Fixed an issue where nn.Linear would cause an internal int underflow … (#119221)
…when trying to reshape a scalar input.

Fixes #119161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119221
Approved by: https://github.com/albanD
2024-02-08 21:06:34 +00:00
7fd6b1c558 s/print/warn in arch choice in cpp extension (#119463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119463
Approved by: https://github.com/malfet
2024-02-08 20:38:51 +00:00
db1a4dcb5a [BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039)
Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested).

This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626.

Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039
Approved by: https://github.com/janeyx99
2024-02-08 20:35:32 +00:00
4e93b00b69 [Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450)
`CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`.

Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450
Approved by: https://github.com/jansel
2024-02-08 20:19:18 +00:00
6adadbaf79 Fix jagged NT softmax semantics (#119459)
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
2024-02-08 20:13:12 +00:00
278a0e1600 [NestedTensor] Support binary pointwise ops with >2 inputs (if inputs are non-tensors) (#119419)
It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419
Approved by: https://github.com/soulitzer
2024-02-08 20:06:40 +00:00
cd9a1934fb [ONNX] Bump to onnx1.15.0 and ort1.17.0 in CI (#119106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119106
Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms
2024-02-08 19:26:13 +00:00
91f038161a [FSDP2] Used split_with_sizes_copy for all-gather copy-out (#119451)
This switches to using @yifuwang's `split_with_sizes_copy.out` fast path!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119451
Approved by: https://github.com/yifuwang
ghstack dependencies: #118017, #118118
2024-02-08 19:04:30 +00:00
suo
def572929b [export/nonstrict] always create FakeTensorMode (#119446)
Previously in non-strict mode we would source a FakeTensorMode from existing tensors if available.

It turns out this is problematic, as it means we can't directly control the behavior of this FakeTensorMode. For example, if the user-provided FakeTensorMode does not set `allow_non_fake_inputs=True`, then we get into trouble with constant tensors, etc.

At the moment, we still have to epxlicitly re-fakifky the module state. @ezyang has recommended against this, but it's necessary because `create_aot_dispatcher_function` calls `detect_fake_mode` on all the inputs, which will error if not all the FakeTensors are on the same mode. We should straighten this out, but leaving for the future.

Differential Revision: [D53559043](https://our.internmc.facebook.com/intern/diff/D53559043/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119446
Approved by: https://github.com/ezyang, https://github.com/zhxchen17
2024-02-08 18:54:18 +00:00
7ec6ac89e8 Add lowering to special.modified_bessel_i0 (#118993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118993
Approved by: https://github.com/peterbell10
2024-02-08 18:42:40 +00:00
9242523ad5 [ET-Vulkan] aten.pow.Tensor_Tensor (#119423)
Summary:
This wires the eager-mode operation to the Vulkan shader. We only cover the case where both inputs are Tensor type, which is on par with the existing operators: add, sub, mul, div, floor_div.

It doesn't seem like we can cover [any other of the 8 cases](https://www.internalfb.com/code/fbsource/[e45c04564445b5e67ebb61e6ba53995729686526]/xplat/caffe2/torch/distributed/_tensor/ops/pointwise_ops.py?lines=310-317), right now. We categorize them and explain that what's missing for each.

## Category 1
The other 2/3 "standard" cases requires one of the values to be a scalar,
```
z = torch.pow(x, y)
```
```
aten.pow.Scalar,
aten.pow.Tensor_Scalar,
aten.pow.Tensor_Tensor,
```
which is not currently supported.
```
F 00:00:01.746228 executorch:aten_bridge.cpp:21] In function check_tensor_meta(), assert failed (b.sizes().data() != nullptr): ETensor must have valid sizes array
```

## Category 2
IIUC, these operators require an out argument in the declaration. However, when they are traced they collapsed into Category 1, e.g., we obtain `aten.pow.Tensor_Tensor` not `aten.pow.Tensor_Tensor_out`.

This appears in line with current PT-Vulkan, which only [implements the other two categories](https://www.internalfb.com/code/fbsource/[f148c22604b8e409696fd64f814cda89d091fe7a]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/BinaryOp.cpp?lines=533-558).
```
torch.pow(x, y, out=z)
```
```
aten.pow.Scalar_out,
aten.pow.Tensor_Scalar_out,
aten.pow.Tensor_Tensor_out,
```

## Category 3
IIUC, in-place operators are written like this:
```
x.pow_(y)
```
```
aten.pow_.Scalar,
aten.pow_.Tensor,
```
They are not currently supported.
```
  File "/data/users/jorgep31415/fbsource/buck-out/v2/gen/fbcode/b007eb344207ad7d/executorch/backends/vulkan/test/__test_vulkan_delegate__/test_vulkan_delegate#link-tree/torch/_export/verifier.py", line 188, in _check_valid_op
    raise SpecViolationError(
torch._export.verifier.SpecViolationError: operator 'aten.copy_.default' is not functional
```

Test Plan:
```
[jorgep31415@devvm15882.vll0 /data/users/jorgep31415/fbsource (fd1ed5f81)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- test_vulkan_backend_pow
File changed: fbcode//executorch/backends/vulkan/vulkan_preprocess.py
Buck UI: https://www.internalfb.com/buck2/7f9ec9e5-cbac-4618-b8ad-d94d10bb50ff
Test UI: https://www.internalfb.com/intern/testinfra/testrun/562950306906309
Network: Up: 3.2KiB  Down: 0B  (reSessionID-ea5af789-c131-4170-ba20-5c5c9718276b)
Jobs completed: 7. Time elapsed: 48.5s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D53547865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119423
Approved by: https://github.com/SS-JIA, https://github.com/malfet
2024-02-08 18:31:33 +00:00
b51b27922b Add to_empty() suggestion in the error message (#119353)
Fixes #119293, the comprehensive documentation is [here](0f478d9d61/docs/source/meta.rst (id11)).
Just added the suggestion into the error message so it is more informative to user.

@albanD

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119353
Approved by: https://github.com/mikaylagawarecki
2024-02-08 18:30:02 +00:00
5a77ee6587 Add lowering for logcumsumexp (#118753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753
Approved by: https://github.com/peterbell10
2024-02-08 18:29:34 +00:00
7315ec7505 Revert "Fix estimate_nccl_collective_runtime (#118986)"
This reverts commit 0dab6fb35284ed47d1c6339e9d71e4ca3b50dc51.

Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))
2024-02-08 18:11:53 +00:00
1d61011c11 [MPS] Add support for complex scalars (#119318)
- Switch to native complex support if running on MacOS Monterey or newer for binary ops.
- Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats
- Also add `cf`(for complex float)  and `ch`(for complex half) to MPSScalar value union
- Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view.
- Add `test_tensor_scalar_binops`

Fixes https://github.com/pytorch/pytorch/issues/119088

Test plan: CI (have quite a lot of tests, see new unexpected successes) +  `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318
Approved by: https://github.com/albanD
2024-02-08 18:10:59 +00:00
2b9cba86cf Fix deadlock in ExecutionTraceObserver (#119242) (#119398)
Summary:

With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.

This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.

Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.

Test Plan:
Unit Test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D53533253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398
Approved by: https://github.com/aaronenyeshi
2024-02-08 18:00:51 +00:00
896cf9d1ce [inductor][cpp] vectorization support for int32/int64 (#119001)
This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details:
1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified.
2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs.
3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion.
4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages.

Next steps:

- [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors.
- [ ] Fully utilize vector lanes for bfloat16/float16/int8.
- [ ] Support indirect indexing with vectorized index via scalarization.
- [ ] Clean up `CppVecKernelChecker`.
- [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-02-08 17:38:49 +00:00
8182fce769 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit fbe6f6236e25e27e5968715f824dc8bfb0e37213.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))
2024-02-08 17:20:39 +00:00
8da2f81527 [export] Convert internal tests to using .module() (#119105)
Test Plan: CI

Differential Revision: D53091904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119105
Approved by: https://github.com/ydwu4
2024-02-08 17:19:07 +00:00
c3e0836084 [export] Remove CallSpec (#117671)
Summary: This is not really being used anywhere

Test Plan: CI

Differential Revision: D52842563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117671
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-02-08 17:19:03 +00:00
9436710afd Implement shallow copy functions for FunctionalTensorWrapper. (#118783)
Fix: #115792

This PR implements 2 virtual functions of `TensorImpl` that are called when setting the
`tensor.data`:

- `shallow_copy_from`: which calls `copy_tensor_metadata`; and

- `copy_tensor_metadata`: which copies all `FunctionalTensorWrapper` metadata and ~calls
`dest->value_.set_data(src->value_)`~ assigns `dest->value_ = src->value_`, so as to copy also the inner tensor using the same
method

Before this PR, the inner tensor of a `FunctionalTensorWrapper` was being ignored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118783
Approved by: https://github.com/bdhirsh
2024-02-08 17:15:46 +00:00
6d8f192fd0 [DCP] Call os.sync if os.fsync does not work for fsspec (#119287)
Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync()

If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call.

Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #118888
2024-02-08 17:10:38 +00:00
b251bca205 [dynamo] inlining into __iter__ of user defined object (#119243)
Fixes #119198.

This PR make dynamo inline `__iter__` of a user defined object instead of creating a graph break. Also added a new test, which shows:
1. the loop is unrolled
2. the length of the loop is guarded when inlining `__iter__`
```python
class Mod:
    def __init__(self):
        self.a = [torch.randn(2, 2), torch.randn(2, 2)]

    def __iter__(self):
        return iter(self.a)

def f(mod):
    ret = []
    for x in mod:
        ret.append(x + 1)
    return ret
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119243
Approved by: https://github.com/jansel
2024-02-08 17:07:30 +00:00
b181e52a8f [export] Support non-tensor tuple hoo outputs (#119402)
There's an internal custom op which has a None output, so when it becomes auto_functionalized, the HOO's output is (None, Tensor, Tensor, ...). This PR adds support for the None output, and any int/bool outputs from HOOs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119402
Approved by: https://github.com/suo, https://github.com/avikchaudhuri
2024-02-08 16:54:40 +00:00
7f05c72864 [nccl flight recorder] record time we discover start and complete (#119249)
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
2024-02-08 16:48:33 +00:00
3a8bf25fdd [SparseCsr] Remove triton sdpa skip after triton pin update (#109601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601
Approved by: https://github.com/desertfire, https://github.com/amjames
2024-02-08 16:40:25 +00:00
d947534782 [DCP] Enable filesystem/fsspec auto detection (#118888)
This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id.

Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-02-08 16:38:04 +00:00
4f2bf7fa87 Print the value of constants in __str__ (#119276)
Not sure why we haven't been doing this really...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119276
Approved by: https://github.com/jansel
2024-02-08 16:23:36 +00:00
579999a731 [PyTorch] Back scalar value to pinned memory for .item() (#119202)
Summary: This diff optimizes the .item() call by backing the scalar value storage with pinned memory, so we dont create an implicit synchronization with libcuda library.

Test Plan:
# Prod VDD model on H100
Vanguard runs
9.8k qps -> 10.1k qps (~3% improvement)

# .item() Benchmark
1 thread 50k iterations

consistent ~2-3% improvements

With pinned memory
item() took 1.627608060836792 seconds
item() took 1.635591983795166 seconds
item() took 1.6398141384124756 seconds
item() took 1.6378591060638428 seconds
item() took 1.618534803390503 seconds
item() took 1.6467158794403076 seconds
item() took 1.6278800964355469 seconds
item() took 1.6205573081970215 seconds
item() took 1.64951753616333 seconds
item() took 1.6286702156066895 seconds

w/o pinned memory
item() took 1.6783554553985596 seconds
item() took 1.6670520305633545 seconds
item() took 1.6748230457305908 seconds
item() took 1.6708712577819824 seconds
item() took 1.6836023330688477 seconds
item() took 1.6518056392669678 seconds
item() took 1.6769678592681885 seconds
item() took 1.661888837814331 seconds
item() took 1.6627326011657715 seconds
item() took 1.6908581256866455 seconds

Differential Revision: D53431148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119202
Approved by: https://github.com/xw285cornell
2024-02-08 16:23:15 +00:00
08657b82f5 Reduce scope of dispatching in logcumsumexp_backward (#119397)
Everything inside the `AT_DISPATCH` block is being compiled 5 times,
so it makes sense to limit it to the only line that uses `scalar_t` which is
the `numeric_limits` query.

Also a small optimization, instead of computing `grad.log()` and `(-grad).log()`
we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397
Approved by: https://github.com/lezcano, https://github.com/albanD
2024-02-08 15:09:22 +00:00
56364124af [Dynamo][16/N] Move skipfiles to trace_rules.py (#119432)
This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432
Approved by: https://github.com/jansel
2024-02-08 09:41:52 +00:00
0a41ac3cf3 [1/2] Intel GPU Runtime Upstreaming for Stream (#117611)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`.

# Design
Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per priority per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like
 - `XPUStream getStreamFromPool`
 - `XPUStream getCurrentXPUStream`
 - `void setCurrentXPUStream`
 - `void device_synchronize`

# Additional Context
In our plan, 2 PRs should be submitted to PyTorch for `Stream`:
1. for c10
2. for python frontend.

The differences with CUDA:
no default and external stream in XPU and lack of the below API:
- `getDefaultCUDAStream`
- `getStreamFromExternal`

for cuda, `cuda::device_synchronize` can sync all streams on the device, but for xpu, `xpu::sync_streams_on_device` only sync all reserved streams on the device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117611
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-08 09:07:23 +00:00
cyy
7d516bbd5f Update MacOS deployment target to OS version 11.1 (#119373)
To avoid the following error:
```
2024-02-07T12:49:51.8306390Z ld: warning: dylib (/Users/runner/work/_temp/anaconda/envs/wheel_py38/lib/libomp.dylib) was built for newer macOS version (11.1) than being linked (11.0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119373
Approved by: https://github.com/huydhn
2024-02-08 08:19:42 +00:00
5f6b35915a [executorch hash update] update the pinned executorch hash (#119336)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119336
Approved by: https://github.com/pytorchbot
2024-02-08 03:38:53 +00:00
f579c65ef6 Release GIL for torch::autograd::clear_autocast_cache (#119416)
Fixes #119262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119416
Approved by: https://github.com/albanD
2024-02-08 03:22:48 +00:00
9d6bf20022 [FSDP2] Added backward prefetching (#118118)
This PR adds explicit backward prefetching to overlap communication and computation in backward (namely, needed for `reshard_after_forward=True` or `reshard_after_forward: int`). We do this by recording the post-forward order and using its reverse to approximate the backward order.

This works for the typical 1 forward / 1 backward training. However, for more complex schedules, this can run into some gaps:
- We need to know the _true end of backward_.
    - At the true of end of backward, we can clear our recorded post-forward order and pre-backward hook state, and we should wait on gradient reductions.
    - There is no easy way to know whether the current backward marks the true end of backward. Therefore, we introduce an API for the user to set this: `fsdp_module.set_is_last_backward(bool)`. For example, for pipeline parallelism's DFS cooldown backward, we can call `fsdp_module.set_is_last_backward(is_last_microbatch)`.
- When the user runs backward through only part of the model, our reverse-post-forward-order heuristic risks _mistargeted prefetches_ for unused modules, which would mean the module's parameters are all-gathered and not freed until the end of backward.
    - To error on the side of less memory usage (but no overlap), this PR introduces logic to check whether a module will need its unshard in the current backward (by recording the module's `forward` outputs' `grad_fn`s and querying the autograd engine).
    - Note that there may be _no_ overlap in backward for some parts due to no prefetching.
    - Note further that when running multiple backwards, if the user does not use `set_is_last_backward`, we may not be able to provide a meaningful error message, as the pre-backward hook could be erroneously cleared on the 1st backward.
    - In the future, we may expose more APIs from the autograd engine (similar to `_current_graph_task_execution_order`) to make the prefetching exact. (Currently, `_current_graph_task_execution_order` requires the `with torch.autograd.set_multithreading_enabled(False)`, which is too hard of a constraint as we cannot easily modify users' training loops. We can replace the multi-threading check with a device check. Moreover, in the partial backward case in this PR's unit test, I still hit an [internal assertion](b816760a2f/torch/csrc/autograd/engine.cpp (L476)), so some follow-up is required.)

<details>
<summary> Old Discussion </summary>

For discussion:
- The PR includes a counter `expected_backward_unshard_count` to mitigate mistargeted prefetches in backward. However, it can be seen as a necessary but not sufficient solution.
    - If a module's outputs do not require gradient, then we certainly do not need to unshard the module in backward.
    - However, if a module's outputs do require gradient, then we still may not need to unshard the module for _this_ backward (e.g. if the module did not contribute to `loss` for the current `loss.backward()`).
    - This counter will only address the first case but not the second. If we want to address the second, then we may need more info from the autograd engine.
- For now, I did not include any unit test to cover these behaviors, as I do not have a good example yet.
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118118
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118017
2024-02-08 03:17:45 +00:00
1d2382f141 [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)
**Summary**
The reducer of `DistributedDataParallel`  is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.

**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all  `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3.  `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.

**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.

The bucketing is done in a separate PR.

Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
2024-02-08 03:03:15 +00:00
113506d2d4 Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
Partially fixes https://github.com/pytorch/pytorch/issues/105077

Repro:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor.

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang
2024-02-08 03:01:34 +00:00
9a992b0918 [4/4] Intel GPU Runtime Upstreaming for Device (#116869)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
2024-02-08 03:01:21 +00:00
3cb7ec312c [PT-Vulkan] aten::conv1d - opt: width-pack weight tensor (>2x speedup) (#118835)
## This diff
This optimization reduces calls to `texelFetch(uKernel, ...)` by 4.

We borrow MatMul's work to do the re-packing:

https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50

## Future optimziations

We are already batching reads from input/weight tensors, and writes to output tensor.

Here are other ideas, which I won't pursue for now. (2) is the most doable.
1. **Batch reads/writes along the dimension that is most commonly > 1.** For weights, the length dimension is definitely correct here, but input/outputs could potentially leverage the length dimensions too. However, `stride != 1` would complicate this optimization.
2. **Batch an optimal number of reads/writes.** Instead of default-ing to 4 elements (since that corresponds to 1 texel), consider more elements such as MatMul's 4x4 texel tile.
3. **Obscure shader compiler optimizations.** Since MatMul seemed to benefit from several seemingly equivalent ways to write code.

Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118835
Approved by: https://github.com/SS-JIA, https://github.com/liuk22
2024-02-08 02:23:51 +00:00
2349e473f1 Forward fix for same_shape oblivious guard (#119383)
Fixes internal test

```
buck2 test '@fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn_test -- --exact 'accelerators/workloads/models/slimdsnn:slimdsnn_test - test_generate (accelerators.workloads.models.slimdsnn.test_slimdsnn.SlimDSNN)'
```

And I added an OSS test that approximates the internal situation.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D53544208](https://our.internmc.facebook.com/intern/diff/D53544208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119383
Approved by: https://github.com/atalman, https://github.com/albanD
2024-02-08 02:11:46 +00:00
64aaa8f508 Fix typo on Contribution Guide (#119428)
Fixes #119427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428
Approved by: https://github.com/awgu, https://github.com/kit1980
2024-02-08 01:07:27 +00:00
fbe6f6236e Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-08 00:54:16 +00:00
33761969a4 Remove parent device mesh check (#118620)
Removes raising error if a device_mesh has a parent.

The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are:
- this check
- https://github.com/pytorch/pytorch/pull/118618
- a series of PRs related to checkpointing with 3D meshes that I will open
We currently monkeypatch for the above which I am slowly upstreaming.

I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620
Approved by: https://github.com/Skylion007
2024-02-08 00:49:28 +00:00
029a16c41f [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-02-07 22:29:29 +00:00
6fe5a3adaf release GIL for cudaEventDestroy (#119393)
cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393
Approved by: https://github.com/Skylion007
2024-02-07 22:16:18 +00:00
ad75d9e2ca [easy] Fix test_triton_kernel_reinterpret_view_mem_leak by cloning fwd input (#119219)
```

$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view_mem_leak

# Before
RuntimeError:
Found following user inputs located at [0] are mutated. This is currently banned in the aot_export workflow.
If you need this functionality, please file a github issue.

fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=True, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutates_storage_metadata=False, requires_grad=False, mutation_type=<MutationType.MUTATED_OUT_GRAPH: 3>),...)

# Now
Ran 6 tests in 13.851s
OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119219
Approved by: https://github.com/oulgen
2024-02-07 21:30:16 +00:00
81abc2b249 Revert "[quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701)"
This reverts commit 482d952e880cf78c103a06f2d483556ab0a89138.

Reverted https://github.com/pytorch/pytorch/pull/118701 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118701#issuecomment-1932866964))
2024-02-07 20:56:16 +00:00
a6e16fe202 Fix global in header warning (#119380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119380
Approved by: https://github.com/janeyx99
2024-02-07 20:35:21 +00:00
35aa353c48 Change watchdog log from "NCCL" to "Process group" (#118121)
This PR changes the watchdog log.
In order to avoid confusion that NCCL creates a watchdog thread and reports the error log, it is better to change "NCCL" to "Process group" to better indicate the source of the log.

@wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118121
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-02-07 20:14:49 +00:00
892a7bf674 [BE]: Add filelock typing to mypy stubs (#119390)
Realized we used filelock in some places, but didn't have a mypy type stub for it. Noticed it in this PR: https://github.com/pytorch/pytorch/pull/119386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119390
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-07 20:14:28 +00:00
d0db80126e [EZ][CI] Fetch full history for MPS jobs (#119401)
Otherwise emitting TD stats will fail with following warning:
```
Emiting td_test_failure_stats
/Users/ec2-user/runner/_work/pytorch/pytorch/tools/testing/target_determination/heuristics/edited_by_pr.py:37: UserWarning: Can't query changed test files due to Command '['git', 'merge-base', 'origin/main', 'HEAD']' returned non-zero exit status 1.
  warn(f"Can't query changed test files due to {e}")
```

Test plan: Observe that MPS jobs finishes without those warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119401
Approved by: https://github.com/atalman, https://github.com/huydhn
2024-02-07 19:29:30 +00:00
51fb99250b Fix missing MAST log when there is Unicode non-decodable text in logs (#119298)
Summary:
## Issue
When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102

In the example, the process stopped producing Python logs after 17:20:21 untill the job finished
```
[0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % | (gpu mem: 25.8GB, free CPU mem: 1387.8GB)
I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0]
```
At the end, `UnicodeDecodeError` was thrown at the end with no call stack.

## Fix
Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens.

Test Plan: f528854819

Differential Revision: D53483644

Co-authored-by: Jack Zhang <jackzh@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298
Approved by: https://github.com/XilunWu
2024-02-07 19:25:43 +00:00
02c24b0b5e Add Python binding resizable to class {Untyped,Typed}Storage (#119286)
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.

Fixes #119233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
d054cd3e44 [FSDP2] Added reshard_after_forward (#118017)
This PR adds the `reshard_after_forward: Union[bool, int]` arg and a `reshard()` method. The `reshard_after_forward` argument trades off communication and memory.
- `reshard_after_forward=True`: reshard parameters after forward; unshard (all-gather) in backward
- `reshard_after_forward=False`: no reshard of parameters after forward; no unshard (all-gather) in backward
- `reshard_after_forward: int`: reshard parameters to a smaller world size; unshard (all-gather) over small world size in backward

In comparison with DeepSpeed and existing FSDP:
- `reshard_after_forward=True` == `FULL_SHARD` == ZeRO-3
- `reshard_after_forward=False` == `SHARD_GRAD_OP` == ZeRO-2
- `reshard_after_forward=8` == ZeRO++

ZeRO-1 is `reshard_after_after_forward=False` without gradient reduction (implemented in a later PR). If we need gradient reduction on an iteration, then ZeRO-2 supersedes ZeRO-1.

We prefer a simple state transition between `SHARDED` / `SHARDED_POST_FORWARD` and `UNSHARDED`, where the state directly defines what tensors are registered to the module. In particular, we _do not_ have a state where the sharded parameters are registered but the unsharded parameters are still in GPU memory. This greatly simplifies our state transitions, but it means that parameters may be non-intuitively registered to the module (e.g. if only the root does not reshard after forward, then the root will be the only without sharded parameters registered). To address this, we introduce a simple `reshard()` method that can force-reshard the parameters. This makes sense to me because the typical case does not care about the registered parameters after forward (in fact, for existing FSDP with `use_orig_params=False`, the unsharded parameters are still registered and are dangling tensors without storage.)

I plan to expose a complementary `unshard(async_op: bool = True)` method in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118017
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-02-07 19:14:20 +00:00
482d952e88 [quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701)
Summary:
This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to remove `fold_quantize` flag from
`convert_pt2e`

Test Plan: CI

Differential Revision: D53247301

BC Breaking Note:

flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization.
2.2
```
folded_model = convert_pt2e(model, fold_quantize=True)

non_folded_model = convert_pt2e(model)
```

2.3
```
folded_model = convert_pt2e(model)

non_folded_model = convert_pt2e(model, fold_quantize=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118701
Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
2024-02-07 19:10:51 +00:00
0e2330d84c fix lint (#119395)
Summary: as title

Test Plan: lint

Differential Revision: D53532399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119395
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2024-02-07 19:06:41 +00:00
23b030a79c [easy] Add testing utilties for torch.nn.utils.set_swap_module_params_on_conversion (#118023)
For above PR to parametrize existing `load_state_dict` tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118023
Approved by: https://github.com/albanD
ghstack dependencies: #118028, #117167
2024-02-07 18:55:44 +00:00
d5a718d27b Add swap_tensors path to nn.Module._apply (#117167)
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify  to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.

From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1.  The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**

***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.

If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.

**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
2024-02-07 18:55:44 +00:00
91d1d2c421 Make MHA Query Scaling Behaviors Consistent (#119323)
The multi-head attention (MHA) query scaling behaviors are not consistent when [`need_weights`](8ac9b20d4b/torch/nn/modules/activation.py (L1073)) values are different.

On the current main, when `need_weights = True`, the query scaling was performed using a [division](8ac9b20d4b/torch/nn/functional.py (L5434)) and it will be exported as a `Div` operator in ONNX. When `need_weights = False`, the query scaling was performed using a [multiplication](422b4271ae/aten/src/ATen/native/transformers/attention.cpp (L711)) and it will be exported as a `Mul` operator in ONNX defined in the [PyTorch ONNX Symbolics](422b4271ae/torch/onnx/symbolic_opset14.py (L177)).

We should make the query scaling behaviors consistent. On most of the platforms, multiplication performs no worse than division. Therefore, we should use multiplication consistently for both `need_weights = True` and `need_weights = False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119323
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD
2024-02-07 18:42:57 +00:00
5eda355e54 [inductor, test] remove cast for test_pow2_cpu (#114912)
Verifies https://github.com/pytorch/pytorch/issues/94010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114912
Approved by: https://github.com/angelayi
2024-02-07 18:32:30 +00:00
0dab6fb352 Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #118910, #118911, #118437
2024-02-07 18:02:51 +00:00
088d538a8d Revert "[Inductor] GEMM shape padding improvements (#118522)"
This reverts commit cc46829f96dba05b9b46bae31a1e6d2a053f667e.

Reverted https://github.com/pytorch/pytorch/pull/118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](https://github.com/pytorch/pytorch/pull/118522#issuecomment-1932557670))
2024-02-07 17:42:14 +00:00
f6bf7d26e1 Print full exception info in Graph break log (#119292)
So, this is a little awkward, so I don't mind more thoughts on how best to do this.

Let's suppose that you have a graph break inside of an inlined function call. We are not actually going to print this graph break yet; instead, we are going to restart analysis so that we can run up until the inlined function call. When this happens, the only log message we ever get is the log to `graph_break` (seen here) reporting that a graph break has occurred.

In the current code, we don't print the fully formatted exception if you are only using `graph_breaks` logging. So the exception that induced the graph break has its traceback lost forever. For some classes of errors, esp., guard on data-dependent SymInt, this is quite bad.

With this change, we do print the traceback. On this sample program:

```
import torch
import torch._dynamo.config

torch._dynamo.config.capture_scalar_outputs = True

def g(x, y):
    y = x.item()
    if y < 3:
        return x + 2
    else:
        return x + 3

@torch.compile()
def f(x, y):
    y = y * y
    return g(x, y)

f(torch.tensor(4), torch.randn(4))
```

It looks like this:

```
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: Traceback (most recent call last):
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 878, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return guard_scalar(self.sym_num)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 414, in guard_scalar
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return guard_bool(a)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 663, in guard_bool
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return a.node.guard_bool("", 0)  # NB: uses Python backtrace
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/sym_node.py", line 366, in guard_bool
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return fn(*args, **kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3670, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     concrete_val = self.size_hint(orig_expr)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3403, in size_hint
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     raise self._make_data_dependent_error(result_expr, expr)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3).  For more information, run with TORCH_LOGS="+dynamic".
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] During handling of the above exception, another exception occurred:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Traceback (most recent call last):
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return inner_fn(self, inst)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     self.call_function(fn, args, {})
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     self.push(fn.call_function(self, args, kwargs))
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 279, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return super().call_function(tx, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 87, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return tx.inline_user_function_return(
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2262, in inline_call
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return cls.inline_call_(parent, func, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2372, in inline_call_
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     tracer.run()
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     and self.step()
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     getattr(self, inst.opname)(inst)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 431, in inner
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     eval_result = value.evaluate_expr(self.output)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 880, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     raise UserError(  # noqa: TRY200
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch._dynamo.exc.UserError: Consider annotating your code using torch._constrain_as_*(). It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3).  For more information, run with TORCH_LOGS="+dynamic".
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] From user code at:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/b.py", line 16, in f
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return g(x, y)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/b.py", line 8, in g
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     if y < 3:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
```

The end of the log at restarted computation maybe can be improved too. Right now it looks like this:

```
[2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 2 [UserFunctionVariable(), LazyVariableTracker(), TensorVariable()]
[2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='Consider annotating your code using torch._constrain_as_*(). It appears that you\'re trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3).  For more information, run with TORCH_LOGS="+dynamic".\n\nFor more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example', user_stack=[<FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 16 in f>, <FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 8 in g>], graph_break=True)
```

An alternative to doing it this way, is I can make symbolic shapes print a warning log when guard on unbacked SymInt itself, so we don't have to worry about Dynamo generating the backtrace well. If, for the most part, the backtrace for other graph breaks is irrelevant, then this would seem to be a more expedient solution.

PTAL and submit your opinions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119292
Approved by: https://github.com/yanboliang
2024-02-07 17:20:31 +00:00
f79ae7599a [export] fakify module state in nonstrict (#119297)
Summary:
Previously, we were not fakifying module state explicitly in the nonstrict path.

This led to errors when modules were constructed under a fake mode, since the user-provided fake mode was clashing with the one that we had constructed internally to fakify the inputs.

This fixes things to use a single fake mode for everything.

As a side effect, this raised the question of how we ought to serialize state_dicts/constants that might be fake tensors. Naively calling torch.save understandably explodes—so this diff piggybacks on our infra for doing this on meta["val"]. Open to revising this, I'm low confidence that it's the best way to do it.

Test Plan: unit tests

Differential Revision: D53484942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119297
Approved by: https://github.com/tugsbayasgalan
2024-02-07 17:12:22 +00:00
40ec155e58 [AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066)
Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle.

Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066
Approved by: https://github.com/khabinov
2024-02-07 16:54:00 +00:00
059994d2b7 Migrate load_state_dict hook tests to OptimizerInfo (#119310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119310
Approved by: https://github.com/albanD
ghstack dependencies: #119283, #119288, #119299, #119308
2024-02-07 16:00:01 +00:00
0320e62255 Migrate test_state_dict hooks to OptimizerInfo (#119308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119308
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283, #119288, #119299
2024-02-07 16:00:01 +00:00
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
3625ccfbea Move step global hooks test to OptimizerInfo (#119299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119299
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283, #119288
2024-02-07 15:50:31 +00:00
7b3762e6bc Move step pre/post hook tests to OptimizerInfo (#119288)
Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though!

With the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 13.680s

OK
```

Excluding the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 1.038s

OK
```

The old tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..
----------------------------------------------------------------------
Ran 2 tests in 0.518s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283
2024-02-07 15:50:31 +00:00
99ddfaf572 Add symbol guard counts instrumentation (#119290)
This helps us understand if there are symbols which are extremely hot
(i.e., have a lot of guards mentioning them).  Extremely hot symbols are
candidates for being turned static.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119290
Approved by: https://github.com/bdhirsh
2024-02-07 14:35:14 +00:00
7c95cc5e03 Add basic reference documentation for symbolic_shapes.py (#118997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997
Approved by: https://github.com/albanD
2024-02-07 14:33:42 +00:00
1435cfecfa Increase accumulate_grad_ gradient's expected refcount to account for pybind (#119068)
Account for pybind of the op holding 1 ref when torch.ops.inductor.accumulate_grad_.default is called during run time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119068
Approved by: https://github.com/jansel
ghstack dependencies: #118817, #119334
2024-02-07 10:25:43 +00:00
326dcf9dc8 Never reuse accumulated gradients' buffers (#119334)
Since accumulate grad may steal the gradient's `c10::Storage`, we can't reuse the op otherwise the gradient will get overwritten. From benchmarks, using the inductor's codegen'd _empty_strided_cpu/cuda and assigning to it has lower overhead than deep copying the gradient and reusing its buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119334
Approved by: https://github.com/jansel
ghstack dependencies: #118817
2024-02-07 10:25:42 +00:00
8e14e1d514 Fix gradient refcounts in pybind and compiled autograd (#118817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817
Approved by: https://github.com/jansel
2024-02-07 10:25:42 +00:00
d85631b721 Revert "Fix deadlock in ExecutionTraceObserver (#119242)"
This reverts commit 6fc775ae13b675f8d02f7f85bc4348bba3ae3dd3.

Reverted https://github.com/pytorch/pytorch/pull/119242 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119242#issuecomment-1931445631))
2024-02-07 07:37:22 +00:00
dfdbd73360 add Half support for flash attention (#119247)
Re-open for https://github.com/pytorch/pytorch/pull/118368.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119247
Approved by: https://github.com/drisspg, https://github.com/malfet
2024-02-07 05:57:41 +00:00
0f478d9d61 [Dynamo][15/N] Merge allow_in_graph/inline/skip trace rules check into trace_rule.lookup (#118971)
Finally we have this PR to merge allow_in_graph/inline/skip trace rules into ```trace_rules.lookup_inner```, where we can define and lookup trace rules at both function level and file level. Going forward, this is the central place that we define and consulte Dynamo trace rule for any function.
* ```trace_rules.looup``` is the API can return allow_in_graph, inline or skip.
* ```skipfiles.check``` is the API can return inline or skip, since we have multiple places that only do inline/skip check.
  *  I'll move ```skipfiles.check``` to ```trace_rules.check``` as one of the follow-ups.
* Both functions consulte ```trace_rules.lookup_inner``` to get the tracing rule.

To avoid a single big PR, I left a few items as the follow-ups:
* Remove ```skipfiles.py``` and merge the code into ```trace_rules.py```.
* We do double check in ```symbolic_convert.check_inlineable```, will refactor and simplify it. We should only do inline/skip check before generating ```SkipFilesVariable``` and ```UserFunctionVariable```.
* Rename ```SkipFilesVariable``` as ```SkipFunctionVariable```, since we only handle functions.
* The inline/skip reasons are not logged for some cases, since the new lookup framework doesn't always return inline/skip reasons. I'll refactor loggings to record the inline/skip reason in next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118971
Approved by: https://github.com/jansel
2024-02-07 05:15:39 +00:00
284b0b5f44 Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562)
Addresses issue https://github.com/pytorch/pytorch/issues/117383

The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr)

## Behavior
### with --tee
Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console.

### with --redirect
When --redirect is specified without --tee, nothing is logged to console, so we no-op.

### with neither
When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console.

The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation.

## Usage
### without --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py
hello from rank 0 python
DEBUG: TRACED GRAPH
 __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
-------------  ------  -----------------------  ---------  --------
placeholder    l_x_    L_x_                     ()         {}
call_function  mul     <built-in function mul>  (l_x_, 5)  {}
output         output  output                   ((mul,),)  {}
...
```
### with --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py
[rank0]:hello from rank 0 python
[rank0]:DEBUG: TRACED GRAPH
[rank0]: __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
[rank0]:-------------  ------  -----------------------  ---------  --------
[rank0]:placeholder    l_x_    L_x_                     ()         {}
[rank0]:call_function  mul     <built-in function mul>  (l_x_, 5)  {}
[rank0]:output         output  output                   ((mul,),)  {}
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-02-07 04:29:54 +00:00
6c3600d008 Enable optional tensorList fallback to cpu. (#119273)
add optional tensorList fallback to cpu.
Add testcases and old pr is: https://github.com/pytorch/pytorch/pull/106449

@bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119273
Approved by: https://github.com/bdhirsh
2024-02-07 03:54:13 +00:00
53ee47ca32 [vision hash update] update the pinned vision hash (#119337)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119337
Approved by: https://github.com/pytorchbot
2024-02-07 03:43:26 +00:00
ee1c2449f7 [dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107)
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
2024-02-07 03:32:42 +00:00
fcc36de9d6 [ONNX][dynamo_export] Turn off opmath type promotion for div (#119112)
Skip opmath promotion for `_prims_common.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` as well.
Fixes https://github.com/pytorch/pytorch/issues/118941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119112
Approved by: https://github.com/thiagocrepaldi
2024-02-07 03:27:00 +00:00
45a79323fe Add torch.dtype instances to the public API (#119307)
Fixes #91908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119307
Approved by: https://github.com/albanD
2024-02-07 02:57:49 +00:00
8c2fde1fcf [EZ][BE] [CMake] Remove checks for GCC-7 (#119306)
As PyTorch now uses C++17 and needs gcc-9.4+ to compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119306
Approved by: https://github.com/Skylion007
2024-02-07 01:24:01 +00:00
e9907a3446 [PyTorch] Free up 8 bytes per intrusive_ptr_target (#117986)
We don't need 64-bit reference and weak counts. (We also probably don't need a full 32 bits, but we'll deal with that later.)

Differential Revision: [D52851891](https://our.internmc.facebook.com/intern/diff/D52851891/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117986
Approved by: https://github.com/ezyang
2024-02-07 00:48:00 +00:00
5f2ad407a9 Fix typo on torch.frombuffer() documentation (#119214)
Fixes #114345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119214
Approved by: https://github.com/albanD
2024-02-07 00:41:51 +00:00
5ae6f6cffe Test seo torch cuda (#119324)
Testing if this will help improve SEO of this page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324
Approved by: https://github.com/albanD
2024-02-07 00:39:51 +00:00
728228a7c7 LazyGraphModule: improve the fix for the FakeTensorMode mismatch issue (#119311)
The previous fix https://github.com/pytorch/pytorch/pull/118981 misses some corner cases. It works when both LazyGraphModule and compiled-autograd are enabled. But it fail with FakeTensorMode mismatch error again if LazyGraphModule+CompiledAutograd+DynamicShape are all enabled. Note that disabling any of the three does not trigger the issue.

The reason why enabling DynamicShape cause the previous fix not working is, we will call the bw_compiler here before running the backward pass if there are symints saved for backward: 73f0fdea5b/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L382)

The bw_compiler may cause extra GraphModule recompilation on the bw_module which cause it's forward method become the lazy one again. The fix is just to delay applying the previous fix after the potential extra call of the bw_compiler.

Repro on hf_Whisper:
```
CUDA_VISIBLE_DEVICES=1 time benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --disable-cudagraphs --accuracy --only hf_Whisper --repeat 1 --compiled-autograd  --dynamic-batch-only
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119311
Approved by: https://github.com/xmfan, https://github.com/jansel
2024-02-07 00:35:39 +00:00
e868a7fedd [AOTI] Rename config.aot_inductor.abi_compatible (#119065)
Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag.

Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065
Approved by: https://github.com/khabinov
2024-02-07 00:14:33 +00:00
c814d8e5c2 Fix handling random() calls encountered inside inlined code. (#119218)
Fix https://github.com/pytorch/pytorch/issues/118787

In the compiled function, calls to random() are replaced with a single function call
to a function that generates all the random variables .
The random calls encountered during compilation used to be tracked inside a variable
stored inside the instruction translator. And when there are nested translators, the tracked
calls used to get lost when the inner instructions translator popped out.

This diff fixes that by moving the tracked calla to the output graph which is shared across translators that are generating the same function.

More details about the issue and why this solution is picked are in the github issue above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119218
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-06 23:48:21 +00:00
5e78c4b0f4 [dynamo] Functools partial reconstruct (#118583)
Replaces #117721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118583
Approved by: https://github.com/yanboliang
ghstack dependencies: #118901, #118616
2024-02-06 23:42:43 +00:00
62cc1053d8 [dynamo] Fix missing guards in FunctoolsPartialVariable (#118616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118616
Approved by: https://github.com/yanboliang
ghstack dependencies: #118901
2024-02-06 23:42:43 +00:00
6fc775ae13 Fix deadlock in ExecutionTraceObserver (#119242)
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.

This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.

Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.

Test Plan:
Unit Test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D53299183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242
Approved by: https://github.com/aaronenyeshi
2024-02-06 23:36:22 +00:00
d0ca849fdf Refactor Symint Deduping to separate pass (#118938)
Previously Symint Deduping was done during proxy tracing which made it more difficult to reason about. This refactors the deduping to a separate pass.

We only dedupe symints which are resolvable from input symint nodes so as to avoid inducing a dependency on the backward in the forward.

potential fix for : https://github.com/pytorch/pytorch/issues/118224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118938
Approved by: https://github.com/ezyang
2024-02-06 23:07:31 +00:00
dea15c9fdc Revert "Add meta registration for _foreach_norm (#118604)"
This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05.

Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))
2024-02-06 22:20:44 +00:00
6c1cca153e [quant][pt2e] Allow users to override train/eval behavior (#119091)
Summary: This commit adds a util for PT2E quantization users
to call `model.train()` and `model.eval()` without error.
Instead, these will automatically call the equivalent
`move_exported_model_to_train/eval` for the user, which only
switch behavior for special ops like dropout and batchnorm.
This enables users to onboard to the PT2E flow more easily.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_allow_exported_model_train_eval

Reviewers: jerryzh168, tugsbayasgalan, zhxchen17

Subscribers: jerryzh168, tugsbayasgalan, zhxchen17, supriyar

Differential Revision: [D53426636](https://our.internmc.facebook.com/intern/diff/D53426636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119091
Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2024-02-06 22:19:58 +00:00
9d46fe603d Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)"
This reverts commit 4ab852b6c558a0b8e9fea0c863c782fe65f00be0.

Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))
2024-02-06 22:14:36 +00:00
0f68bcaa5c Make filename optional in update_failures.py (#119289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119289
Approved by: https://github.com/zou3519
2024-02-06 21:56:09 +00:00
422b4271ae Change PrivateUse1's resize_bytes to PrivateUse1HooksInterface (#117839)
Reopen from https://github.com/pytorch/pytorch/pull/117211
Modify the logic for entering the registration branch so that existing uts are not affected.
Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117839
Approved by: https://github.com/albanD
2024-02-06 20:51:56 +00:00
ae4e866bba [dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438)
Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090.

Changes:
- Move CacheEntry and ExtraState to C++
- Use pybind to control reference counting
- Use std::list instead of manually implementing a linked list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438
Approved by: https://github.com/jansel
2024-02-06 20:48:11 +00:00
73f0fdea5b [fix] accounting for dilation in pool padding assertion (#118897)
Fixes https://github.com/pytorch/pytorch/issues/7541

It is a copy of https://github.com/pytorch/pytorch/pull/111427, I have failed to fix all its issues in time, and it got closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118897
Approved by: https://github.com/mikaylagawarecki
2024-02-06 20:32:58 +00:00
ec31d11580 [dynamo] Skip dynamo when inside a functorch context (#118901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118901
Approved by: https://github.com/zou3519
2024-02-06 20:22:24 +00:00
f3645fc38b Update auto_functionalize docs (#119228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119228
Approved by: https://github.com/zou3519
2024-02-06 19:50:54 +00:00
f85b0ea8bb Migrate last lbfgs test over to OptimizerInfo (#119283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119283
Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki
2024-02-06 19:49:05 +00:00
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
5410385c42 [dynamo] support comparing stream with constant (#119199)
Before the pr, we have a graph break for:
```python
def f():
    if torch.cuda.current_stream() is not None:
        return torch.randn(2, 2)
torch.compile(f, backend="eager", fullgraph=True)()
```
This pr supports comparson ops of StreamVariable and ConstantVariable by returning a constant.

It's safe to return a constant in this case becuase the StreamVariable is guarded by ID_MATCH when created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119199
Approved by: https://github.com/yifuwang, https://github.com/anijain2305, https://github.com/jansel
2024-02-06 19:26:03 +00:00
fa157af69c [mypy] declare type for DynamoTestCase._exit_stack (#119084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119084
Approved by: https://github.com/Skylion007
2024-02-06 18:26:07 +00:00
238d87f74d Add a short code snippet in the RNN doc (#119150)
Fixes #109443,
also remove a duplicated comment line `# Efficient implementation equivalent to the following:` in scaled_dot_product_attention doc.

@mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119150
Approved by: https://github.com/malfet
2024-02-06 17:41:51 +00:00
169c070076 Move catch_errors_wrapper to convert_frame (#119253)
With this change, we now have the invariant that eval_frame only
contains "hot" functions that are called at runtime, as opposed to
cold functions which are only called at compile time.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119253
Approved by: https://github.com/yanboliang
ghstack dependencies: #119251
2024-02-06 17:40:07 +00:00
790858afa9 Make start compiling stack trace omit framework frames (#119251)
Fixes https://github.com/pytorch/pytorch/issues/119238

Here's what it looks like now:

```
$ TORCH_LOGS=+torch._dynamo.convert_frame python a.py
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] torchdynamo start compiling f /data/users/ezyang/b/pytorch/a.py:3, stack (elided 5 frames):
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]   File "/data/users/ezyang/b/pytorch/a.py", line 7, in <module>
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]     f(torch.randn(2))
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]     return fn(*args, **kwargs)
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]
$ cat a.py
import torch

@torch.compile
def f(x):
    return x * 2

f(torch.randn(2))
```

The eval_frame frame is intentionally present, since what happens is you run the torch.compile wrapper, and then you actually hit the user frame to be compiled.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119251
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-02-06 17:40:07 +00:00
22669843c2 Reserve sizes in c10::VaryingShape::concrete_sizes(), c10::TensorType::computeStrideProps() (#119189)
Summary: Costly reallocs.

Test Plan: CI

Reviewed By: efiks

Differential Revision: D53264908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119189
Approved by: https://github.com/Skylion007
2024-02-06 17:13:37 +00:00
8ee9f26ce8 [Dynamo] Remove build_checkpoint_variable from call_getattr (#119236)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119236
Approved by: https://github.com/jansel
2024-02-06 16:59:40 +00:00
2ad3599a71 Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST (#118979)
Summary: Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST

Test Plan: See the one in D53154041
Reviewed By: yjhao, yanboliang, Yuzhen11

Differential Revision: D53154041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118979
Approved by: https://github.com/yanboliang
2024-02-06 16:25:33 +00:00
a77be631e0 Bugfix to MixtureSameFamily's _pad_mixture_dimension (#118947)
Fixes Issue #73792

This is a duplicate of pull request. #73864. It's a small bugfix that should have happened a long time ago, but it didn't because I didn't actually follow up with the pull request after originally submitting. That's my bad. Trying to remedy the error.

This contains a fix to _pad_mixture_dimension, which intends to count the number of dimensions in its referent tensors, but accidentally counts the number of elements (and can thus end up creating tensors with potentially thousands of dimensions by mistake). Also contains a single test for the fixed behavior.

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118947
Approved by: https://github.com/soulitzer
2024-02-06 16:24:22 +00:00
499040ac32 Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186)"
This reverts commit 426339e4de2efc0cbd501e2bff947ba890ec9817.

Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1929978008))
2024-02-06 15:04:48 +00:00
1e4b408b02 [decomp] Add tests for different dtypes to SDPA decomposition (#119239)
Summary: As titled. Skipping torch.bfloat16 because for some reason the
difference is 0.01.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119239
Approved by: https://github.com/drisspg
2024-02-06 11:17:07 +00:00
85033759d6 Update scatter_reduce_ test with parallel backend check (#118708)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2024-02-06 09:43:40 +00:00
7d7a3f0b37 [inductor] Support sympy.expr in user-defined Triton kernel grid fn (#119165)
## Problem

A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`.

Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below.

## Approach

Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`.

Differential Revision: D53367012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165
Approved by: https://github.com/aakhundov
2024-02-06 08:39:55 +00:00
8a8e70477e Fix type hints on nn.attention.sdpa_kernel (#119140)
Fixes #119133
Altered type hint and assert to include SDPBackend; disallowed None in assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119140
Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch, https://github.com/drisspg
2024-02-06 07:33:22 +00:00
720f781160 [CPU] Optimize softmax as flash attention v2 (#118957)
### Descriptions
According to flash attention v2, optimize softmax by dividing sum out of the KV inner loop.

### Performance
Stable Diffusion V2.1 on GNR

| Version | Kernel time (s) | Speedup |
|---------|----------------|----------------|
| BF16 Before | 28.67 |
| BF16 After | 23.55 | 17.86% |
| FP32 Before | 54.20 |
| FP32 After | 49.47 | 8.73% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118957
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-02-06 07:06:36 +00:00
4ab852b6c5 [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab
2024-02-06 06:59:47 +00:00
884b6d2a67 [inductor] Implementing missing magic methods on IR values. (#118933)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118933
Approved by: https://github.com/peterbell10
2024-02-06 05:50:26 +00:00
e47f571da7 Revert "Update scatter_reduce_ test with parallel backend check (#118708)"
This reverts commit d670dfb7ae0a88cf010455301eb1d0ef91950f1a.

Reverted https://github.com/pytorch/pytorch/pull/118708 on behalf of https://github.com/leslie-fang-intel due to Test Case still fail ([comment](https://github.com/pytorch/pytorch/pull/118708#issuecomment-1928767568))
2024-02-06 04:37:08 +00:00
12ac3ba383 [executorch hash update] update the pinned executorch hash (#118936)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936
Approved by: https://github.com/pytorchbot
2024-02-06 03:41:33 +00:00
3497388b9f [export] Fix serialization for auto_functionalization (#118810)
- Added support for serializig the auto_functionalization op, which
  required adding the functions `serialize_arbitrary_inputs` and
  `serialize_arbitrary_outputs` which will serialize the inputs/outputs
  without needing a schema, since HOOs do not have a schema.
- Added support for serializing user input mutations
- Added support for serializing operator inputs. They just get turned
  into strings.

Differential Revision: [D53331039](https://our.internmc.facebook.com/intern/diff/D53331039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118810
Approved by: https://github.com/suo
2024-02-06 03:41:05 +00:00
03db96c248 [Dynamo] Enhance autograd.Function strict mode test (#119237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119237
Approved by: https://github.com/zou3519
2024-02-06 02:54:19 +00:00
074f2bb5ce Fix dynamo benchmark runner for torchbench skip sets (#118615)
Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032

This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615
Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang
2024-02-06 02:06:54 +00:00
9250965f8b [ez] Lower windows timeout limit for trunk, set test step timeout (#119234)
Lower windows timeout to be the same as linux

Step timeout thing for win (linux version + details for why at https://github.com/pytorch/pytorch/pull/93084)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119234
Approved by: https://github.com/huydhn
2024-02-06 01:54:31 +00:00
86d5d1650b [dynamo] support dict.clear() (#119197)
For code like following:
```python
import torch
def f():
    a = {"a": torch.randn(2, 2)}
    a.clear()
    return a
torch.compile(f, backend="eager", fullgraph=True)()
```

We have a graph break before the pr:
```
torch._dynamo.exc.Unsupported: call_method ConstDictVariable() clear [] {}
```

Test Plan:
Added new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119197
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-06 01:17:55 +00:00
c0164f2393 Revert "[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039)"
This reverts commit 04d52d5399ad4abb8af9e8405be79e2a7f8b4c7a.

Reverted https://github.com/pytorch/pytorch/pull/119039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MPS test in trunk 04d52d5399,  may be a landrace ([comment](https://github.com/pytorch/pytorch/pull/119039#issuecomment-1928595240))
2024-02-06 01:13:28 +00:00
3829b55416 [inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166)
Differential Revision: D53398312

## Problem
Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead.

```
# What we see
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...);

# What we want
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...)
```

## Approach
Use C++ wrapper's expression printer to handle this conversion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166
Approved by: https://github.com/aakhundov
2024-02-06 00:33:25 +00:00
781f7c9080 [BE] Use OptimizerInfo step_requires_closure, only_supports_sparse_grads (#119230)
So I had planned ahead of time to use these but forgot to actually use them when migrating tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119230
Approved by: https://github.com/albanD
2024-02-06 00:13:43 +00:00
69344fe987 c10d: Don't add NCCL backend by default without CUDA (#119149)
The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh:
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Fixes #117746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149
Approved by: https://github.com/kwen2501
2024-02-05 23:55:07 +00:00
fd0bf96c2b [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-05 23:35:41 +00:00
04d52d5399 [BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039)
Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested).

This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626.

Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039
Approved by: https://github.com/janeyx99
2024-02-05 23:19:01 +00:00
d9d8c2b79f Remove HSDP validation check (#112435)
Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check.

However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112435
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-02-05 22:27:53 +00:00
966db82c9d Revert "Remove extra graph breaks (#118987)"
This reverts commit 9a8e3b07d75e3e9bb902f81b4b6e1042bbe06b58.

Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))
2024-02-05 22:19:37 +00:00
b8bb12cd45 Add meta registration for _foreach_norm (#118604)
This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls.

For script:
```
import torch

ts = [torch.rand(32, 16, device="cuda") for _ in range(128)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    res = torch._foreach_norm(ts)
print(p.key_averages().table(sort_by="cpu_time_total"))
```

OG baseline:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        25.36%       4.209ms        99.94%      16.586ms      16.586ms       8.000us        88.89%       9.000us       9.000us             1
                                       cudaLaunchKernel        61.21%      10.159ms        61.21%      10.159ms       2.540ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.43%      71.000us        58.35%       9.683ms       9.683ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.33%      55.000us        57.35%       9.517ms       9.517ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.42%      69.000us        57.01%       9.462ms       9.462ms       1.000us        11.11%       1.000us       1.000us             1
                                           aten::select         8.04%       1.335ms        11.29%       1.873ms      14.633us       0.000us         0.00%       0.000us       0.000us           128
                                       aten::as_strided         3.24%     538.000us         3.24%     538.000us       4.203us       0.000us         0.00%       0.000us       0.000us           128
                                            aten::empty         0.90%     150.000us         0.90%     150.000us      75.000us       0.000us         0.00%       0.000us       0.000us             2
                                  cudaDeviceSynchronize         0.06%      10.000us         0.06%      10.000us      10.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        11.11%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        66.67%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        22.22%       2.000us       2.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 16.596ms
Self CUDA time total: 9.000us
```

And here's after this PR:
```
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        30.95%       4.653ms        99.95%      15.026ms      15.026ms       9.000us        90.00%      10.000us      10.000us             1
                                       cudaLaunchKernel        52.41%       7.879ms        52.41%       7.879ms       1.970ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.39%      58.000us        48.29%       7.260ms       7.260ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.35%      53.000us        47.25%       7.103ms       7.103ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.43%      65.000us        46.90%       7.050ms       7.050ms       1.000us        10.00%       1.000us       1.000us             1
                                            aten::empty        15.42%       2.318ms        15.42%       2.318ms      17.969us       0.000us         0.00%       0.000us       0.000us           129
                                  cudaDeviceSynchronize         0.05%       7.000us         0.05%       7.000us       7.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        10.00%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        60.00%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        30.00%       3.000us       3.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 15.033ms
Self CUDA time total: 10.000us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604
Approved by: https://github.com/albanD
2024-02-05 22:01:01 +00:00
51e096114b Increase recommended logging in DEFAULT_LOGGING (#119207)
For long running batch jobs, it is best to opt for logs that are too
spammy rather than not spammy enough.  This lines up DEFAULT_LOGGING
with our current internal guidance at Meta.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119207
Approved by: https://github.com/bdhirsh
2024-02-05 21:59:10 +00:00
5086e1cf3f Remove distributed/c10d/Functional.hpp (#119138)
This file is useless and was accidentally checked in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119138
Approved by: https://github.com/Skylion007
2024-02-05 21:58:08 +00:00
200108c6e6 Delete old branches (#117079)
Example https://github.com/pytorch/pytorch/actions/runs/7562281351/job/20592425611?pr=117079 (The code to delete branches isn't being run, it's just listing the branches it wants to delete)

Internal code: https://fburl.com/code/hdvvbfkj

Threshold for branch with PR is 30 days regardless of whether or not the PR is merged or not (compared to 3 days if merged and 30 days if closed).  Threshold for branch without PR is 1.5 years (same internally).

Threshold of ~400 queries to github so it doesn't hit token usage limits.  Currently this leads to about 350 branches deleted per run.

Only query for the last 90 days of updated PRs to reduce token usage, so if a branch has a PR but it was updated 90+ days ago, it will think it doesn't have a PR and will wait for the 1.5 years branch update check instead, regardless of whether the PR is open or closed.

I tested that it could delete my own branch and it worked.

labeled with test-config/crossref because I just want the smallest test config possible to reduce CI usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117079
Approved by: https://github.com/malfet
2024-02-05 20:50:05 +00:00
b816760a2f More progress on type checking ValueRanges (#118870)
Type checking Python is a pain. Here are my learnings:

* The types for heavily polymorphic code is going to be verbose, no way around it. I originally was hoping I could lean on polymorphism with a bounded TypeVar to compactly write signatures for many of the ValueRanges methods, but I ran into some unworkaroundable mypy bugs. Writing out all the types explicitly and using `@overload` liberally works pretty well, so I think I recommend people do that instead of trying to do fancy things.
* Sympy is missing annotations for assumptions, because they are all metaprogrammed. I don't really relish maintaining a typeshed for sympy, so I wrote a small mypy plugin to add them in.
* GADT style refinement is... just not a good idea in practice. Mypy easily gets confused whether or not a return value from a refined section is allowed for the outer return type. So many of these have been replaced with less informative implementation types and more informative external types via overloads. Hopefully this is good for use sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118870
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-02-05 20:29:25 +00:00
b92819a039 Move nn.Module.load_state_dict tests from test_nn.py to separate file (#118028)
Move these tests out so in https://github.com/pytorch/pytorch/pull/117913 where we can to run these tests with both `torch.nn.utils.set_swap_module_params_on_conversion({True/False})`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118028
Approved by: https://github.com/albanD
2024-02-05 20:17:28 +00:00
71655bccbe Fix wrong mobile build Docker image (#119213)
It turns out that the Docker image name hasn't been updated yet referring to a non-existing name, may be we could update `calculate-docker-image` to fail in this case if there is a way to separate a non-existing name failure v.s. missing tag failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119213
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet
2024-02-05 19:48:10 +00:00
962fca6839 [storage][perf] Reduce _get_device_from_module overhead. (#119144)
Using `rsplit` with maxsplit=1 is more efficient since it 1) stops traversal as soon as the first `.` from the right side is encountered 2) creates no more than 2-element list

This change also reuses `last_part` to avoid unnecessary repetition of a split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119144
Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki
2024-02-05 19:33:18 +00:00
b964a1222c Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit c24ffc3f66b2270dfc65a404687b91b55ed580e9.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))
2024-02-05 19:25:39 +00:00
b2e0f8d82d [mypy] added type annotations to codegen_nodes methods (#119080)
added correct type annotations to scheduler and backends'
codegen_nodes methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080
Approved by: https://github.com/eellison
2024-02-05 18:33:52 +00:00
88e346680b Patch all_gather to support HSDP + TP (#118638)
Update all_gather to support HSDP + TP.

Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118638
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337
2024-02-05 18:29:23 +00:00
f481835115 Revert "add Half support for flash attention on CPU (#118368)" (#119204)
This reverts commit a5a63db3bf937a6eff993d1222fab18cc63f9cb2.

Fixes #ISSUE_NUMBER

Reverts #118368

Got reverted internally but branch got deleted to automation didn't work

Mildly edited stack trace
```

...
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/external_utils.py", line 25, in inner
    return fn(*args, **kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 635, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "torch/fx/experimental/proxy_tensor.py", line 995, in trace
    res = super().trace(root, concrete_args)
  File "torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/external_utils.py", line 25, in inner
    return fn(*args, **kwargs)
  File "torch/fx/_symbolic_trace.py", line 793, in trace
    (self.create_arg(fn(*args)),),
  File "torch/fx/experimental/proxy_tensor.py", line 665, in wrapped
    out = f(*tensors)
  File "<string>", line 1, in <lambda>
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 357, in _functionalized_f_helper
    f_outs = fn(*f_args)
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 68, in inner_fn
    outs = fn(*args)
  File "torch/_functorch/_aot_autograd/utils.py", line 161, in flat_fn
    tree_out = fn(*args, **kwargs)
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 618, in functional_call
    out = PropagateUnbackedSymInts(mod).run(
  File "torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 593, in run_node
    result = super().run_node(n)
  File "torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "torch/fx/interpreter.py", line 274, in call_function
    return target(*args, **kwargs)
  File "torch/_ops.py", line 571, in __call__
    return self_._op(*args, **kwargs)
  File "torch/_subclasses/functional_tensor.py", line 380, in __torch_dispatch__
    outs_unwrapped = func._op_dk(
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 744, in __torch_dispatch__
    return self.inner_torch_dispatch(func, types, args, kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 779, in inner_torch_dispatch
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 423, in proxy_call
    r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 1225, in maybe_handle_decomp
    return CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
  File "torch/_decomp/decompositions.py", line 4322, in scaled_dot_product_flash_attention_for_cpu
    torch._check(
  File "torch/__init__.py", line 1133, in _check
    _check_with(RuntimeError, cond, message)
  File "torch/__init__.py", line 1116, in _check_with
    raise error_type(message_evaluated)
RuntimeError: query must be FP32, FP64, BF16 but got torch.float16

While executing %_scaled_dot_product_flash_attention_for_cpu : [num_users=1] = call_function[target=torch.ops.aten._scaled_dot_product_flash_attention_for_cpu.default](args = (%l_q_, %l_k_, %l_v_), kwargs = {attn_mask: %l_attn_mask_})
Original traceback:
  File "executorch/backends/xnnpack/partition/graphs/sdpa.py", line 34, in forward
    return torch.nn.functional.scaled_dot_product_attention(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119204
Approved by: https://github.com/kit1980
2024-02-05 18:24:53 +00:00
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b35034c96c5b4604ecf8960e6aab856f.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
124a54ef16 [jit][perf] Reduce lookupInModule overhead. (#119145)
It's inefficient to split remaining parts of the module name by '.' just to join it back again. Instead it's more idiomatic and efficient to use `maxsplit=1` to ensure that all remaining parts remain intact. This improves best case time and space complexity since scan can terminate on first encountered `.` and only 2 parts are returned in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119145
Approved by: https://github.com/Skylion007
2024-02-05 18:01:00 +00:00
fa8d97776c [aotinductor] Migrate fuse_split_linear_add from dper_pass to AOTI based on predispatch IR (#118983)
Summary: As titled. Added support of fuse_split_linear_add in pregrad passes based on predispatch IR

Test Plan: TORCH_LOGS=inductor,aot   buck2 run  mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb

Differential Revision: D53302168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118983
Approved by: https://github.com/kflu, https://github.com/chenyang78
2024-02-05 17:58:42 +00:00
5f9f771711 [DeviceMesh][Test] Remove test_raises_mesh_dim_less_than_2 (#119172)
The test is no longer applicable after we allow 1D slice from 1D mesh. https://github.com/pytorch/pytorch/pull/118895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119172
Approved by: https://github.com/awgu, https://github.com/atalman
2024-02-05 17:34:51 +00:00
d444a3b443 [MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771)
Fixes #114285

(However, still have NotImplementedError
```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771
Approved by: https://github.com/lezcano
2024-02-05 15:36:55 +00:00
a72190fd51 make nanogpt work with both compiled autograd and _LazyGraphModule (#118981)
@xmfan and @fegin reported that _LazyGraphModule ( https://github.com/pytorch/pytorch/pull/117911 ) makes nanogpt training fail with compiled autograd.

We have a repro:  ``` python benchmarks/dynamo/torchbench.py --training --backend=inductor --disable-cudagraphs --accuracy --only nanogpt --repeat 1 --compiled-autograd ```
but it's still mysterious how to trigger the issue with a toy model.

The error message for the failure is https://gist.github.com/shunting314/6402a6388b3539956090b6bc098952fb . In compile_fx we will call `detect_fake_mode`. This function will look for an active FakeTensorMode from both TracingContext and example inputs. The error is triggered because we find different FakeTensorMode from these 2 sources.

Although I don't know what really causes the discrepancy of FakeTensorMode above, the fix here is to force _LazyGraphModule recompilation if we have compiled autograd enabled. This does not hurt compilation time most of the time because we anyway will call the graph module here in the backward pass when compiled autograd is enabled: 855d5f144e/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L705)

Let me know if we can have a better fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118981
Approved by: https://github.com/jansel
2024-02-05 10:40:06 +00:00
d670dfb7ae Update scatter_reduce_ test with parallel backend check (#118708)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-02-05 08:48:45 +00:00
0348975a87 Set up new logging artifact for SymNode (#119158)
Fixes #113876

Hi, I updated various logging configs and the SymNode module to use the new dedicated logging artifact. This is my first pytorch PR, mirrored my changes off of https://github.com/pytorch/pytorch/pull/111808.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119158
Approved by: https://github.com/ezyang
2024-02-05 07:34:54 +00:00
0245000be8 [DeviceMesh] Temporarily disable re-use subgroup (#118940)
Summary:
The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan).
We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940
Approved by: https://github.com/wanchaol
2024-02-05 06:30:00 +00:00
0c3a1c893e [dynamo] Setup the globals for guard_fn without a reference to f_locals (#118447)
UPDATE - I changed the PR because from discussion with @jansel it was clear that someone else was holding on to a reference to f_locals. This PR now solves that problem first. I removed the eval_frame.c part because it was failing tests that use `exec` or `eval` with weird error like `no no locals found when storing 'math'`. I would debug that in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118447
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #118975, #118420
2024-02-05 05:39:39 +00:00
b8307513e5 [torchelastic][rendezvous] Add option to enable libuv for TCPStore based rendezvous backend (#118944)
Summary:
Expose an option to enable libuv in TCPStore based rendezvous backend that will allow better scaling.

Libuv support has been added recently and allows scaling for more than 2K nodes.

Test Plan: Unit tests

Differential Revision: D53335860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118944
Approved by: https://github.com/wconstab
2024-02-04 23:11:32 +00:00
5ebed6f1c3 [torch] fix comment typo (#118656)
Summary: as title

Differential Revision: D49841787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118656
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17
2024-02-04 22:20:09 +00:00
0d5f53a2f9 fix forward test_memory_planning.py (#119109)
Summary: fixes a broken test, also makes it run in fbcode correctly

Test Plan: test

Reviewed By: angelayi

Differential Revision: D53373709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119109
Approved by: https://github.com/angelayi
2024-02-04 21:45:07 +00:00
052e824467 improve CUDACachingAllocator lock contention (#118550)
Summary: NativeCachingAllocator has a global lock which shows lock contention with one process using multiple GPUs. The lock is required to lookup Block from pointer. We can make the lock more fine grain to reduce the lock contention.

Test Plan: existing unittests, verified on prod models using eight GPUs showing double digits improvements

Differential Revision: D52493091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118550
Approved by: https://github.com/albanD
2024-02-04 16:45:25 +00:00
b41f3e8df1 [AOTI] Make abi_compatible as default for OSS CI (#119126)
Summary: Introduce an environment varible AOT_INDUCTOR_ABI_COMPATIBLE to control the ABI-compatible mode, and turn it on for OSS CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119126
Approved by: https://github.com/chenyang78
ghstack dependencies: #119125
2024-02-04 15:48:58 +00:00
79b20aec76 [AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125)
Summary: These ops exist in GoogleFnet. Also add a Complex fallback for convert_element_type. After this PR, we can enable ABI-compatible for AOTInductor OSS CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119125
Approved by: https://github.com/chenyang78
2024-02-04 15:48:58 +00:00
cee16353db [Dynamo][autograd.Function] Should graph break on stride accesses in backward (#119137)
Fixes #118399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119137
Approved by: https://github.com/oulgen
2024-02-04 09:08:45 +00:00
8f82a44a5b Run device mesh tests with native funcol enabled (#118437)
### Summary

Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled.

All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437
Approved by: https://github.com/LucasLLC
ghstack dependencies: #118910, #118911
2024-02-04 04:11:11 +00:00
cyy
e3371ff739 Use correct type of indices in ForeachUtils.h (#119116)
Fix a type mismatch detected by MSVC:
```
C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): warning C4267: “初始化”: 从“size_t”转换到“_Ty”,可能丢失数据
        with
        [
            _Ty=int
        ]
C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): note: 模板实例化上下文(最早的实例化上下文)为
pytorch/aten/src\ATen/native/ForeachUtils.h(363): note: 查看对正在编译的函数 模板 实例化“_Ty &std::vector<_Ty,std::allocator<_Ty>>::emplace_back<const I&>(const I &)”的引用
        with
        [
            _Ty=int,
            I=size_t
        ]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119116
Approved by: https://github.com/Skylion007
2024-02-04 04:03:54 +00:00
6620176da7 Add documentation for meta device (#119119)
Fixes https://github.com/pytorch/pytorch/issues/119098

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119
Approved by: https://github.com/bdhirsh
2024-02-04 01:05:22 +00:00
dab16b6b8e s/supress/suppress/ (#119132)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119132
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-02-04 00:54:14 +00:00
abc09b27b9 Some minor type stub improvements (#118529)
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
2024-02-04 00:19:00 +00:00
3ed9df36a9 Clean up some obsolete TODOs in run_test and several test files (#119113)
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen.  I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
2024-02-03 23:54:30 +00:00
26a2743162 Fix placeholder tensor is empty for relu in mps (#118965)
Fixes #118845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965
Approved by: https://github.com/malfet
2024-02-03 23:50:35 +00:00
0ddcb5c3ca Include the documentation on scale arg being a keyword only arg (#119129)
Fixes #117240
@drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119129
Approved by: https://github.com/drisspg
2024-02-03 23:41:06 +00:00
ffae20e594 [BE][MPS] Add dictionaryFromPlaceholders (#119077)
Which are a convenience methods that create a dictionary from placeholder, making code a more compact.
Also added `runMPSGraph` overloaded function with Placeholder instead of an output dictionary, as majority of the operators have just one  output.
Typical change looks as follows
```patch
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
-      selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
-    };
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results =
-        @{outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()};
-    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+    auto feeds = dictionaryFromPlaceholders(selfPlaceholder);
+    runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
   }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119077
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-02-03 22:07:02 +00:00
2d64fddd48 [dtensor] add op support for nll_loss_forward (#118917)
This is part of the work to support cross entropy in dtensor.

This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917
Approved by: https://github.com/wanchaol
2024-02-03 20:08:10 +00:00
4c397e6ec6 [Dynamo] Add correct guards for tracable tensor subclasses (#119110)
Fixes #118896
```
(pt) [ybliang@devgpu002.ash8 ~/local/pytorch (subclass)]$ TORCH_LOGS="+guards" python test/dynamo/test_subclasses.py -k test_torch_dispatch_subclass_guard_recompile
/home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['w'], 110557008)                           # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].a, '_dynamo_dynamic_indices') == False         # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].b, '_dynamo_dynamic_indices') == False         # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:388 in init_ambient_guards
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224)                     # _dynamo/output_graph.py:394 in init_ambient_guards
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].a, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1])  # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].b, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1])  # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,206] [0/1] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'], '_dynamo_dynamic_indices') == False           # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:388 in init_ambient_guards
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224)                     # _dynamo/output_graph.py:394 in init_ambient_guards
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1])  # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119110
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh, https://github.com/yoyoyocmu
2024-02-03 18:12:51 +00:00
7a52455102 [dynamo] Refactor TensorVariable method handling (#119111)
This should slightly improve compile times and be easier to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119111
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-02-03 17:18:19 +00:00
fcf22a853d Enable test_ellipsis_index_2 with Torch dynamo (#118773)
Fix issue #118819

test_ellipsis_index_2 is specifically testing properties of torch._numpy.array()
and that a field tensor is being added hence overriding the imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118773
Approved by: https://github.com/anijain2305, https://github.com/lezcano
2024-02-03 10:33:48 +00:00
1adedc3c86 [decomp] Remove pixel_shuffle from core aten decomps (#118921)
pixel_shuffle is a core aten op
(https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it.

https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure
(https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921
Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano
2024-02-03 08:21:32 +00:00
4dc53f777b Fix dynamo failure w/ astype (#117952)
The torch "fake" ndarray had some mismatches vs numpy.ndarray which caused test_sparse_to_sparse_compressed to fail under dynamo.

This also fixes (because the test now hits it) a problem where unpacking a sequence with the incorrect number of args would assert in dynamo instead of graph breaking (because it would throw an exception). Added a unit test for this condition.

Fixed:
- torch._numpy._ndarray.astype() (actually used by the test)
- torch._numpy._ndarray.put() (drive-by discovery)
- torch._numpy._ndarray.view() (drive-by discovery)

(burndown item 7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117952
Approved by: https://github.com/yanboliang
ghstack dependencies: #117951
2024-02-03 08:10:15 +00:00
c6c851102f Fix test_compressed_layout_conversions_coverage to check BSC format (#117951)
test_compressed_layout_conversions_coverage verifies torch's conversions between different memory layouts using numpy as a reference. Since numpy doesn't support BSC format it just skipped that. Instead fake it by using a transposed BSR format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117951
Approved by: https://github.com/zou3519
2024-02-03 08:10:15 +00:00
6c8faf4680 [executorch] Run llama in xplat (#118831)
Summary:
Error running llama in xplat, where half type isnt part of c10_mobile targets. See:  D53158320

This diff:
- Creates a `torch_mobile_all_ops_et` target, which is the same as `torch_mobile_all_ops`, except with a preprocessor flag (C10_MOBILE_HALF) to support Half type
- Check C10_MOBILE_HALF in LinearAlgebra.cpp and include it
- Use `torch_mobile_all_ops_et` for executorch, instead of `torch_mobile_all_ops`.

Considerations:
- Using `torch_mobile_all_ops_et` across executorch means that our runtime binary size for xplat aten increases (see test plan for increase amount, thanks tarun292 for the pointer). This may be okay, as aten mode isn't used in production.

Test Plan:
Run language llama in xplat:
```
buck2 run xplat/executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos
```
And in fbcode:
```
buck2 run fbcode//executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos
```

Test executor_runner size increase with:
```
buck2 build fbcode//executorch/sdk/fb/runners:executor_runner_aten
```
||original|this diff (+half dtype)|diff|
|unstripped|214975784|214976472|+688|
|stripped|71373488|71373808|+320|

Differential Revision: D53292674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118831
Approved by: https://github.com/larryliu0820
2024-02-03 08:07:19 +00:00
a64b03a58e Move lr tensor to cuda if needed (#119073)
Fixes https://github.com/pytorch/pytorch/issues/119026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119073
Approved by: https://github.com/eellison
2024-02-03 07:34:33 +00:00
41b63b26c2 [dynamo] Fix incorrect docstring placements in _guards.py. (#119114)
This makes them unavailable when using help and other tools accessing them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119114
Approved by: https://github.com/kit1980
2024-02-03 06:25:54 +00:00
9a8e3b07d7 Remove extra graph breaks (#118987)
Fixes https://github.com/pytorch/pytorch/issues/104053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987
Approved by: https://github.com/janeyx99
2024-02-03 05:55:09 +00:00
ce40ee8ecd [FSDP] Fixed device_mesh and auto wrap (#119064)
If the user passes `device_mesh`, then we should not forward the process groups to the children during auto wrap and instead just rely on the `device_mesh` argument. This should fix https://github.com/pytorch/pytorch/issues/118906.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119064
Approved by: https://github.com/wz337
2024-02-03 03:57:29 +00:00
18fc1ca7d9 [MPS][BE] Add native lerp support (#119036)
By implementing `out = self + weight * (end-self)` as MPS graph

LERP is tested by `test_output_match_lerp_cpu_float[32|16]` based on OpInfo and 10+ tests from `test_optim.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119036
Approved by: https://github.com/albanD
2024-02-03 02:58:50 +00:00
30d3ff1659 Inline gradcheck functions since they don't have C bindings (#119047)
Gradcheck functions are in python, so they shouldn't be in `torch_c_binding_in_graph_functions`
fixes #118792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119047
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-02-03 02:46:39 +00:00
372e9550bd ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910
2024-02-03 02:42:47 +00:00
65314a6129 [c10d] add an unit test for unordered destruction of PGs (#119045)
Summary:
We were suspecting ncclCommsAbort was hung due to NCCL 2.17's 'bug'
triggered by different ranks calls desctructors of different PGs in
different order. This can be reproed in a NCCL level test for 2.17
We need a test case in c10d to constantly check if PGs can be destructed
in different order
Test Plan:
Run the test and print out the distruction orders are expected
```
[$  python test/distributed/test_c10d_nccl.py
ProcessGroupNCCLTest.test_close_multi_pg_unordered
NCCL version 2.19.3+cuda12.0
[rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 0] ProcessGroupNCCL
destructor entered.
[rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 0] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 1] ProcessGroupNCCL
destructor entered.
[rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 1] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 0] ProcessGroupNCCL
abort finished.
[rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 0] ProcessGroupNCCL
destructor entered.
[rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 0] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 1] ProcessGroupNCCL
abort finished.
[rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 1] ProcessGroupNCCL
destructor entered.
[rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 1] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 0] ProcessGroupNCCL
abort finished.
[rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 1] ProcessGroupNCCL
abort finished.
.
----------------------------------------------------------------------
Ran 1 test in 18.969s
OK](url)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119045
Approved by: https://github.com/yifuwang
2024-02-03 02:37:12 +00:00
857508fa36 Change the internal assert to torch_check in torch::nn::functional::InterpolateFuncOptions (#117831)
Fixes #117333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117831
Approved by: https://github.com/malfet
2024-02-03 02:15:11 +00:00
9ffed22391 Document file format returned by torch.save (#118719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719
Approved by: https://github.com/albanD
2024-02-03 02:11:44 +00:00
2eba82d122 [dynamo] decrease logging level for graph break in higher order op. (#119079)
Fixes https://github.com/pytorch/pytorch/issues/119059.

This hides both logs behind TORCH_LOGS=dynamo. Just logging the exception seems not very informative. So I just put both under log.info(). For the example in the issue the log now looks like:
```
(pytorch-3.10) ~/local/pytorch$ python test.py
(pytorch-3.10) ~/local/pytorch$
```
```
(pytorch-3.10) ~/local/pytorch$ python test.py
(pytorch-3.10) ~/local/pytorch$ TORCH_LOGS=dynamo python test.py
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 272, in <module>
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     y = linear(x)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return func(*args, **kwds)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
[2024-02-02 14:08:19,001] [0/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env
[2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] speculate_subgraph: while introspecting autograd.Function, we were unable to trace function `backward` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] call_method GetAttrVariable(AutogradFunctionContextVariable(Function), needs_input_grad) __getitem__ (ConstantVariable(int),) {}
[2024-02-02 14:08:19,017] [0/0] torch._dynamo.convert_frame: [INFO] Restarting analysis due to _dynamo/symbolic_convert.py:141 in fail_and_restart_analysis
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 272, in <module>
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     y = linear(x)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return func(*args, **kwds)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] create_env
[2024-02-02 14:08:19,021] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] produce_guards
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward /home/yidi/local/pytorch/test.py:257
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 272, in <module>
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     y = linear(x)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 268, in linear
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return UseNeedsInputGradFunction.apply(x)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/autograd/function.py", line 572, in apply
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return super().apply(*args, **kwargs)  # type: ignore[misc]
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return func(*args, **kwds)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env
[2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] Function, Runtimes (s)
[2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0283
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119079
Approved by: https://github.com/zou3519
2024-02-03 02:10:13 +00:00
d91d21fd6f [submodule kineto] Enable profiler connection to daemon during init for cpu only jobs (#118320)
Fixes #112389 and https://github.com/facebookincubator/dynolog/issues/208

This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch.
* When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run).
* Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own.
* Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set.

Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch.
The change in PyTorch is to correctly set the `cpuOnly` argument.

## TestPlan:

Build PyTorch from source with USE_CUDA=0 so we have CPU only based build.  Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79`
(See instructions in PyTorch repo)

For the setup we run dynolog daemon in another terminal
```
buck2 run dynolog/src:dynolog  -- --enable_ipc_monitor &
```

Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'.
```
export KINETO_USE_DAEMON=1
python linear_model_example.py
```
Output shows the profiler registration with dynolog
```
(pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py
INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly =  1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
```

We can also collect a trace using
```
[bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json
Kineto config =
ACTIVITIES_LOG_FILE=/tmp/test.json
PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=500
PROFILE_REPORT_INPUT_SHAPES=false
PROFILE_PROFILE_MEMORY=false
PROFILE_WITH_STACK=false
PROFILE_WITH_FLOPS=false
PROFILE_WITH_MODULES=false
response length = 147
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]}
Matched 1 processes
Trace output files will be written to:
    /tmp/test_1807792.json
```
And trace file contains the trace correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320
Approved by: https://github.com/aaronenyeshi
2024-02-03 01:40:56 +00:00
494c2ec054 [DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887)
There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam.

Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887
Approved by: https://github.com/wz337
2024-02-03 01:14:13 +00:00
6b009aceea Enable scaled_mm on sm89 devices (#118881)
Fixes #118703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118881
Approved by: https://github.com/malfet
2024-02-03 00:44:03 +00:00
440b7d5279 [auto_functionalize] Remove mutated_args_name from args (#119050)
`auto_functionalize` currently takes a custom op, a list of mutated argument names, and inputs to the custom op as kwargs. The list of mutated argument names is computed from the schema, and gets created when we're tracing. However, it seems that having the list of mutated argument names is a little unnecessary since we can always recompute it from the schema during runtime.

This also prevents the case where users might incorrectly modify the inputs to this operator, as we will now just recompute it during the runtime. This probably won't affect things too much because inductor will decompose auto_functionalize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119050
Approved by: https://github.com/zou3519
2024-02-03 00:27:14 +00:00
3aeaa21eb0 Revert "Remove parent device mesh check (#118620)"
This reverts commit 3f1f057adfcd4cef67fff9605a894cb075c02881.

Reverted https://github.com/pytorch/pytorch/pull/118620 on behalf of https://github.com/atalman due to broke periodic linux-focal-cuda11.8-py3.9-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/118620#issuecomment-1924933878))
2024-02-03 00:22:56 +00:00
de6a906093 Expose aggressive_recomputation as an inductor config (#118943)
Summary:
As title.

We found aggressive_recomputation shows memory savings (7% on APS COFFEE model) with 2% QPS loss.

It also gives very promising signal on our auto ac experiments: https://docs.google.com/document/d/1S2qgMg1CwAQ4U1Ffuk2epbEOx06ogZhioX2jKCwL7ZQ/edit

 {F1426175073}

Test Plan:
APS COFFEE from silverlakeli
- Zoom of baseline job: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=927380488801910&tab=overview
- Zoom of job with aggressive_recomputation: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1126815608217470&tab=overview

APS 1100x shrunk version:
- baseline: https://www.internalfb.com/mast/job/aps-yuzhenhuang-afe049505a
- test: https://www.internalfb.com/mast/job/aps-yuzhenhuang-709e41bf0d
Memory from 42.98% -> 41.04%.

Reviewed By: yf225, yuxihu, silverlakeli, richqyz

Differential Revision: D53248057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118943
Approved by: https://github.com/anijain2305, https://github.com/yanboliang
2024-02-03 00:17:03 +00:00
7bbd9befed Improve example for `torch.mode()` (#115308)
Fixes #89820 and improves the documentation.

Co-authored-by: Sam Gross <colesbury@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115308
Approved by: https://github.com/colesbury
2024-02-03 00:13:26 +00:00
c24ffc3f66 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-03 00:06:21 +00:00
576383c2eb Add torch check for dtype within bilinear (#118900)
Fixes https://github.com/pytorch/pytorch/issues/117237
Short-term fix, when dtype does not match, it will be reflected in the torch check.

@ezyang a cpp test case is added
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118900
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-02-03 00:02:00 +00:00
a4355d6b9a Revert "Add --filter-rank to torchrun: allow logs filtering by rank (#118562)"
This reverts commit 73229b4f931f8cd1799b0905d61e3d8e85157bcd.

Reverted https://github.com/pytorch/pytorch/pull/118562 on behalf of https://github.com/xmfan due to breaks MAST precheck, flag naming conflict ([comment](https://github.com/pytorch/pytorch/pull/118562#issuecomment-1924916601))
2024-02-02 23:56:21 +00:00
63fd6883fd [c10d] logging utility for cpp-python stacktrace (#118924)
user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce``

```
LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: "
                       << get_python_cpp_trace();
```

output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838``
```
ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0
#1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0
#2 c10d::get_python_cpp_trace[abi:cxx11]() from :0
#3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0
#4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0
#5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0
#8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#9 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543
#11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838
#15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75
#18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399
#21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308
#24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332
#27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448
#30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520
#39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431
#44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494
#45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#47 inner from /data/users/weif/pytorch/run_fsdp.py:72
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#50 run from /data/users/weif/pytorch/run_fsdp.py:76
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#53 main from /data/users/weif/pytorch/run_fsdp.py:133
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#56 <module> from /data/users/weif/pytorch/run_fsdp.py:137
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134
#59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291
#60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312
#61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208
#62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456
#63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90
#64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357
#65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090
#66 __libc_start_call_main from ??:0
#67 <unwind unsupported> from ??:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924
Approved by: https://github.com/kwen2501
2024-02-02 23:49:18 +00:00
a3cec6a7fa [ONNX] Eliminate redundant TODOs (#119060)
Remove titaiwangms/AllenTiTaiWang/titaiwang created TODOs:

1. Resolved TODOs
2. Turned TODOs to NOTEs if they are not actionable
3. Merge duplicated TODOs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119060
Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi
2024-02-02 23:37:52 +00:00
454e6b380c [export] Prevent specialization on backends (#118683)
Summary: https://github.com/pytorch/pytorch/issues/118289 shows that sometimes we will decompose into backend-specific operators, causing some specializations. We should probably avoid this by disabling these by default?

Test Plan: CI

Differential Revision: D53241300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118683
Approved by: https://github.com/zhxchen17
2024-02-02 23:33:59 +00:00
db2225da37 [export] fix forward test_lift_unlift (#119090)
Test Plan: fixes test

Reviewed By: zhxchen17

Differential Revision: D53367522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119090
Approved by: https://github.com/kit1980
2024-02-02 23:07:36 +00:00
9fe3693bbb [dynamo] bypass graph break due to masking if inference mode (#119056)
Relax the constraints in https://github.com/pytorch/pytorch/issues/114123 when we're in inference mode.

Test Plan:
See added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119056
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-02-02 22:53:23 +00:00
suo
4d45c68ca6 [fx] fix for subgraph rewriter (#119052)
the semantics of `try_get_attr` are to default to None if the attribute doesn't exist; but we were throwing an exception in `get_submodule`. Catch that exception and return None.

Differential Revision: [D53358747](https://our.internmc.facebook.com/intern/diff/D53358747/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119052
Approved by: https://github.com/angelayi
2024-02-02 22:47:53 +00:00
c908caf92b [DeviceMesh] Alllow 1d slice from 1d mesh (#118895)
Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851)

i.e.
mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp"))
then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895
Approved by: https://github.com/wanchaol
2024-02-02 22:00:24 +00:00
6379010ebd [dynamo][higher order ops] Remove restore side effects logic (#118420)
The problem was exposed in https://github.com/pytorch/pytorch/pull/118071 where the control flow tests were always recompiling. The issue turned out that the same nonlocal variable used in `true_fn` and `false_fn` was getting lifted twice and thus creating two inputs in the main Fx graph. Dynamo Tensor guards does not like it because it wants all input tensors to be non-aliased.

We already have logic to check if two different sources (closure of true_fn and closure of false_fn) point to the same tensor using side effects infra. But we were restoring side_effects after subtracing the true and false branches. This is not needed anymore. side_effects trace both read-only as well as actual writes to the variables. For higher order ops, any mutation which is not read-only leads to a graph break and safely exits the tracing. For read-only side effects, its doesn't matter.

This PR removes the restoring of side_effects, which turns on the logic for checking if two different sources point to the same tensor, and thus lifts the common non local tensor to just once in the main graph.

Related discussion at https://github.com/pytorch/pytorch/issues/113235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118420
Approved by: https://github.com/ydwu4, https://github.com/mlazos, https://github.com/zou3519
ghstack dependencies: #118975
2024-02-02 21:54:22 +00:00
113138aa55 add test cases for GradScaler on CPU (#109994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-02-02 21:49:07 +00:00
426339e4de Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
Partially fixes https://github.com/pytorch/pytorch/issues/105077

Repro:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor.

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang
2024-02-02 20:35:38 +00:00
3b41793412 Purge redundant module init tests (#119028)
Fixes #118784

This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028
Approved by: https://github.com/zou3519
2024-02-02 20:17:00 +00:00
a69016a741 Add lowering to special.bessel_j1 (#118992)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118992
Approved by: https://github.com/peterbell10
2024-02-02 20:16:08 +00:00
c7ba5f6c6f [AOTI] Fix a cpp kernel missing arg type issue (#119021)
Summary: The current way of fetching the kernel arg types only works for tensors, not symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021
Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov
2024-02-02 20:11:58 +00:00
debc3b3254 Download reports only if they're necessary (#119027)
Previously we were downloading all of (eager311, dynamo38, dynamo311).
Now we just download what's necessary. This is useful for
update_failures.py because the dynamo tests finish much faster than the
eager tests and it only needs the result from the dynamo tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119027
Approved by: https://github.com/jamesjwu
ghstack dependencies: #118874, #118882, #118931
2024-02-02 20:11:01 +00:00
a68cf3ef7d update_failures.py: add option to also remove "skipped" tests (#118931)
Previously, you could run update_failures.py (with a commit hash) and it
would add new expected failures and skips for newly failing tests and
remove expected failures for newly passing tests.

This PR teaches update_failures.py to also remove skips for tests that
are now passing without them.

The way we do this is:
- dynamo_test_failures.py doesn't actually skip tests -- it runs the
  test and then suppresses the signal.
- if the test actually passed, then the test gets skipped with a special
  skip message
- we teach update_failures.py to look for the presence of that skip
  message.

Test Plan:
- Used this to generate https://github.com/pytorch/pytorch/pull/118928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118931
Approved by: https://github.com/yanboliang
ghstack dependencies: #118874, #118882
2024-02-02 20:11:01 +00:00
1de50f8654 [HigherOrderOp] fix stack trace to report user stack (#118826)
Fixes https://github.com/pytorch/pytorch/issues/111020

For the following code:
```python
import torch
import torch._higher_order_ops.wrap

glob = []

def f(x):
    glob.append(x)
    return x.clone()

@torch.compile(backend='eager', fullgraph=True)
def g(x):
    return torch.ops.higher_order.wrap(f, x)

x = torch.randn(3)
g(x)
```

The stacktrace now becomes:
```
[2024-02-01 15:23:34,691] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting wrap, we were unable to trace function `f` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Traceback (most recent call last):
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     output = f.call_function(tx, args, sub_kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return super().call_function(tx, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return tx.inline_user_function_return(
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return cls.inline_call_(parent, func, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     tracer.run()
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     and self.step()
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     getattr(self, inst.opname)(inst)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return inner_fn(self, inst)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     self.call_function(fn, args, {})
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     self.push(fn.call_function(self, args, kwargs))
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return self.obj.call_method(tx, self.name, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return super().call_method(tx, name, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     tx.output.side_effects.mutation(self)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     self.check_allowed_side_effect(var)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     unimplemented(
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     raise Unsupported(msg)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test.py", line 219, in <module>
    g(x)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 496, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2125, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1227, in call_function
    p_args, p_kwargs, example_value, body_r, treespec, _ = self.create_wrapped_node(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1190, in create_wrapped_node
    ) = speculate_subgraph(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 453, in speculate_subgraph
    raise ex
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph
    output = f.call_function(tx, args, sub_kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function
    return self.obj.call_method(tx, self.name, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method
    return super().call_method(tx, name, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method
    tx.output.side_effects.mutation(self)
  File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation
    self.check_allowed_side_effect(var)
  File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect
    unimplemented(
  File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)

from user code:
   File "/home/yidi/local/pytorch/test.py", line 216, in g
    return torch.ops.higher_order.wrap(f, x)
  File "/home/yidi/local/pytorch/test.py", line 211, in f
    glob.append(x)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118826
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-02-02 20:08:01 +00:00
3c0c387429 Support symbolic min/max on unbacked SymInt (#118953)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118953
Approved by: https://github.com/ColinPeppler, https://github.com/aakhundov
2024-02-02 20:01:46 +00:00
f641c55c9b Make torch._dynamo.mark_static work inside graph (#118962)
I livecoded the entire PR authoring process, you can watch it at https://youtu.be/06HuwNR9-uI

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118962
Approved by: https://github.com/yanboliang
2024-02-02 20:01:27 +00:00
29f99a3365 Update XLA commit pin (#118871)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118871
Approved by: https://github.com/albanD
2024-02-02 19:55:04 +00:00
bd8c91efc0 Remove some now-succeeding tests from dynamo_test_failures.py (#118928)
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118928
Approved by: https://github.com/aorenste, https://github.com/anijain2305, https://github.com/yanboliang
2024-02-02 19:49:26 +00:00
bf4e171539 [export] support non-persistent buffers (#118969)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1817

Basic support for non-persistent buffers, which are buffers that do not show up in the state dict.

One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict.

This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them.

As a side effect, this diff tightened up quite a few sloppy  behaviors around state dict handling:
- Tensor attributes were getting promoted to be buffers—bad!
- Tracing through a module not in the children of the root module would add its parameters/buffers to the state dict—bad!

This behavior is unlikely to show up in user code since the model would be totally broken, but did show up in a bunch of tests.

#buildmore

Test Plan:
unit tests
sandcastle

Differential Revision: D53340041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118969
Approved by: https://github.com/guangy10, https://github.com/huydhn, https://github.com/titaiwangms
2024-02-02 19:16:08 +00:00
b5ba80828f [optim] Rectify capturable testing and fix bugs! (#118326)
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented  that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.

The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.

Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s

OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
2024-02-02 19:13:00 +00:00
8b00e5aa12 [FSDP2] Added pre/post-backward (#118004)
This PR adds the pre- and post-backward logic:
- **Pre-backward hook:** `FSDPState` and `FSDPParamGroup` define this, and `FSDPState` is responsible for registering since its pre-backward should run even if the `FSDPState` does not manage any parameters (in case it is the root).
- **Post-backward hook:** Only `FSDParamGroup` defines this since the post-backward hook reshards parameters and reduce-scatters gradients (functionality only needed with managed parameters). The `FSDPParamGroup` is responsible for registering this.
- **Post-backward final callback:** `FSDPState` defines this, and each `FSDPParamGroup` defines a `finalize_backward()` to call in the final callback.

### Pre-Backward

The pre-backward hook is registered on the module outputs (that require gradient), and it should run when the first such output has its gradient computed. The hook may run multiple times per backward, once per module forward. Specifically, there will be one `(pre-backward, post-backward)` interval for each of the module's `forward()` calls. This is contrast with the existing FSDP semantics, which only defines a single `(pre-backward, post-backward)` interval that is equivalent to the union of this FSDP's `(pre-backward, post-backward)` intervals. This avoids spiking memory from having multiple modules not resharding and avoids some autograd edge cases.

We implement the pre-backward hook by having a flag that is set upon the 1st calls to disable subsequent calls. This flag could be maintained by FSDP, but for a cleaner design, we augment `register_multi_grad_hook` with a `mode="any"` option and use that instead.

### Post-Backward

The post-backward hook is equivalent to a module full backward hook (`nn.Module.register_full_backward_hook`) except it adds pytree logic to work with data structures other than just flat `Tensor` args passed to `nn.Module.forward`. If we were to use `register_full_backward_hook`, then the hook could fire early (before all gradients for the module have been computed). Most internal models use custom data structures as `forward` inputs, and they find that unifying under pytree is an acceptable solution.

Unlike existing FSDP, we are able to reshard the parameters in the post-backward hook _before_ 'concatenating' the autograd-computed gradients, achieving a lower peak memory usage. (Existing FSDP has `SplitWithSizesBackward` that calls a `CatArrayBatched`, and here we have the reduce-scatter copy-in.)

### Final Callback
The final callback runs as a queued callback to the autograd engine, meaning that it runs at the end of backward.

In the future, if we do not want to wait for the reduce-scatter (or similar for CPU offloading), we can augment the final callback. The code is written such that each reduce-scatter can be waited on separately (via CUDA event).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118004
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #117950, #117955, #117973, #117975
2024-02-02 19:10:11 +00:00
a688b4b397 Update pointwise concat heuristics (#118453)
This PR updates the heuristics for lowering to pointwise cat to trigger when we have either a small number of arbitrary pointwise inputs (8) or up to 128 pointwise inputs when they correspond to simple pointwise kernels or data movement.

This originally came from an internal use case which noticed poor codegen: https://fb.workplace.com/groups/1075192433118967/posts/1365770660727808.

In our initial heuristics for lowering to a masked loads pointwise concat kernel we were conservative with the number of inputs we would allow by setting a maximum of 4.

However, I've noticed that we can much more aggressively fuse to pointwise_concat codegen performantly.

In the following benchmark I compare foreach and pointwise_cat codegen : https://gist.github.com/eellison/2bf83231f2940d9b9b33eb4721d35e15.

Here is the [csv output](https://gist.github.com/eellison/529da68b326e1d832c26c1dcdb42c313). When there is neither `gelu` applied on prologue or epilogue pointwise concat is faster (this is just the data movement case). Applying gelu on the epilogue does not affect this result.  When you apply gelu on the prologue, then as the # of inputs starts to increase you end up getting register spills with pointwise concat and it gets slower.

![image](https://github.com/pytorch/pytorch/assets/11477974/0d6612b8-d60f-4984-99eb-9b518cd4af74)

![image](https://github.com/pytorch/pytorch/assets/11477974/4dda3341-68f9-4d1d-8334-67d7196371fb)

When I benchmarked with relu instead of gelu, only as inputs got up to 256 did the pointwise and foreach even out.

![image](https://github.com/pytorch/pytorch/assets/11477974/985418f8-ddb8-47c1-baea-ccd9de72cd7f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118453
Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/mlazos
ghstack dependencies: #118452
2024-02-02 18:31:37 +00:00
3a1ae86a93 Fix internal failure D53291154 (#118907)
Fix internal failure D53291154

from alban: the change is breaking because the alpha argument is now kwarg only (via the * marker) while it was ok for it to be positional before for the rsub.Scalar overload

```
 _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/_dynamo/eval_frame.py", line 615, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
    return _compile(
  File "python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "torch/_dynamo/convert_frame.py", line 650, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "torch/_dynamo/utils.py", line 248, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 531, in compile_inner
    out_code = transform_code_object(code, transform)
  File "torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "torch/_dynamo/convert_frame.py", line 155, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 496, in transform
    tracer.run()
  File "torch/_dynamo/symbolic_convert.py", line 2125, in run
    super().run()
  File "torch/_dynamo/symbolic_convert.py", line 787, in run
    and self.step()
  File "torch/_dynamo/symbolic_convert.py", line 750, in step
    getattr(self, inst.opname)(inst)
  File "torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW
    self.call_function(fn, args, kwargs)
  File "torch/_dynamo/symbolic_convert.py", line 651, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "torch/_dynamo/variables/torch.py", line 614, in call_function
    tensor_variable = wrap_fx_proxy(
  File "torch/_dynamo/variables/builder.py", line 1285, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "torch/_dynamo/variables/builder.py", line 1370, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "torch/_dynamo/utils.py", line 1653, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "torch/_dynamo/utils.py", line 1599, in get_fake_value
    ret_val = wrap_fake_exception(
  File "torch/_dynamo/utils.py", line 1140, in wrap_fake_exception
    return fn()
  File "torch/_dynamo/utils.py", line 1600, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "torch/_dynamo/utils.py", line 1720, in run_node
    raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
  File "torch/_dynamo/utils.py", line 1699, in run_node
    return node.target(*args, **kwargs)
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1637, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1975, in dispatch
    return self._dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 2190, in _dispatch_impl
    r = func(*args, **kwargs)
  File "torch/_ops.py", line 571, in __call__
    return self_._op(*args, **kwargs)
  File "torch/_prims_common/wrappers.py", line 252, in _fn
    result = fn(*args, **kwargs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118907
Approved by: https://github.com/lezcano
2024-02-02 18:17:34 +00:00
fd000340fd ProcessGroupGloo::allgather_into_tensor_coalesced (#118910)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

**I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage**.

### This PR

This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`.  This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910
Approved by: https://github.com/shuqiangzhang
2024-02-02 17:53:28 +00:00
70605d150b [quant][pt2] Add move_exported_model_to_train (#113492)
Summary: This is the equivalent API to `model.train()` for
exported models, analogous to `move_exported_model_to_eval`.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_inplace
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_bn

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113492
Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan
2024-02-02 17:39:47 +00:00
52b679d415 [BE] Cleanup CircleCI README (#118927)
All of the information there is out-of-date as CI/CD has long migrated to the GitHub Actions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118927
Approved by: https://github.com/kit1980
2024-02-02 17:08:20 +00:00
0e5fe4b3ae [AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963)
Summary: generate_index_put_fallback currently generates something like the following,

```
AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)};
```

The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault.

Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963
Approved by: https://github.com/aakhundov
2024-02-02 16:49:45 +00:00
53da422582 [export] Move _create_graph_module_for_export to torch/export (#118893)
Summary: I have to keep the torch/_export one to not break executorch...

Test Plan: CI

Reviewed By: avikchaudhuri

Differential Revision: D52842750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118893
Approved by: https://github.com/zhxchen17
2024-02-02 16:40:01 +00:00
b374f8987d [ROCm] Hipify trie re-engineering and adding unit tests (#118433)
Fixes #[117504](https://github.com/pytorch/pytorch/issues/117504)

Re-engineering Hipify Trie:
(1) Re-engineering Trie.
(2) More documentation or comments for easier understanding
(3) Created a set of unit test (class `TestHipifyTrie`) to test the Trie data structure and APIs.

Test:
```
root@xxx:/development/pytorch# pytest test/test_utils.py -k TestHipifyTrie
==================================================================================================== test session starts ====================================================================================================
platform linux -- Python 3.9.18, pytest-7.3.2, pluggy-1.3.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, shard-0.1.2, hypothesis-5.35.1
collected 11453 items / 11445 deselected / 8 selected
Running 8 items in this shard

test/test_utils.py ........                                                                                                                                                                                           [100%]

============================================================================================ 8 passed, 11445 deselected in 3.84s ============================================================================================
root@xxx:/development/pytorch#
```
Also performed diff on modified and generated contents by this tool with the original code and the new code of the hipify_python.py script. Verified no difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118433
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2024-02-02 16:04:59 +00:00
65efbf078c Optimize dict keys guard when all the keys are constant (#118855)
We also rename ODICT_KEYS and make it use a list rather than a string.

Split from https://github.com/pytorch/pytorch/pull/118630.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118855
Approved by: https://github.com/peterbell10
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199, #118535
2024-02-02 14:42:56 +00:00
cdbc29e91a [dynamo,optim] Use the actual sources from the parameters when tracing "params" in an optimizer (#118535)
Fixes the unnecessary guards described at https://github.com/pytorch/pytorch/pull/117983#discussion_r1467622149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118535
Approved by: https://github.com/mlazos
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199
2024-02-02 14:42:56 +00:00
a3770bcf10 Add functools.partial and UserDefinedFunction to dict keys (#118199)
This is tested by `fullgraph=True` in the `test_getattr_dict` test.
I can write a one-off test for both if that's needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118199
Approved by: https://github.com/peterbell10, https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208
2024-02-02 14:42:35 +00:00
9d592c14eb Don't assume all subclasses of BaseUserFunctionVariable have a fn attribute (#118208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118208
Approved by: https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003
2024-02-02 14:42:06 +00:00
188628d99e [dynamo,easy] Add Typing variable to possible dict keys (#118003)
With this one, the only keys we are not tracing properly in the
(non-skipped) test suite are `OutDtypeHigherOrderVariable()`, and a
couple `UserDefinedObjectVariables`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118003
Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #117982, #118098, #117983, #117625, #118194
2024-02-02 14:40:21 +00:00
ecf7d0e8ac Make dict guards amenable to the CSE pass (#118194)
Supersedes https://github.com/pytorch/pytorch/pull/118096 as a much cleaner and simpler solution.

It is difficult to write a test for this one without exposing too much
of the internals. You can see empirically that it works by running
```
TORCHDYNAMO_PRINT_GUARDS=1 TORCH_LOGS=+guards  python test/test_optim.py -k test_can_load_older_state_dict_ASGD_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118194
Approved by: https://github.com/jansel, https://github.com/peterbell10
ghstack dependencies: #117982, #118098, #117983, #117625
2024-02-02 14:38:48 +00:00
eb2bdfae88 Make variables in dict LazyTrackers (not lazily guarded yet) and avoid using DICT_KEYS guard (#117625)
Make variables in dict lazy and remove DICT_KEYS guard.

We build the keys of a dict depth-first and we rely on the guards of
each element in the dict to create the correct guards. This allows us to
remove the rather buggy DICT_KEYS guard and make the guard lazy.
The guards are not completely lazy yet, as we instantiate them in
`_HashableTracker._eq_impl` but it should be possible to make them
truly lazy.

Also, adding new types to the supported types within keys should be less
error prone.

This is marginally less efficient when we graph break, but in turn we
should graph break much less. It also  makes the dicts code easier to maintain
(removes `is_hashable_python_var`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117625
Approved by: https://github.com/jansel, https://github.com/peterbell10, https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983
2024-02-02 14:38:08 +00:00
75a5c41921 [dynamo,optim] Place guards on the args before assuming they exist (#117983)
This enables the new way of writing guards for dicts. Before we were
doing things like
```
  L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)][3] is L['self'].param_groups[0]['params'][3]
```
without knowing whether `L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)]` was a list.

On a different note, I'll probably write a pass to recover the previous
way to place guards on dicts via something like `DICT_KEYS`  as an
optimisation, as it seems relevant for optimisers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117983
Approved by: https://github.com/mlazos
ghstack dependencies: #117982, #118098
2024-02-02 14:37:46 +00:00
b1da929df9 Use SourcelesBuilder in BuiltinVariable (#118098)
This was failing when fetching a dictionary from a module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118098
Approved by: https://github.com/peterbell10, https://github.com/anijain2305
ghstack dependencies: #117982
2024-02-02 14:37:23 +00:00
0f3e20a1b6 Print the malformed guard when there's a guard error. (#117982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117982
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-02 14:37:05 +00:00
292243d1aa Automatically pull test reports from CI (#118882)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118882
Approved by: https://github.com/jamesjwu, https://github.com/yanboliang
ghstack dependencies: #118874
2024-02-02 14:18:56 +00:00
0f7954107a Add ability to print histogram as a github issue (#118874)
Adds the ability to print the failures histogram into lines that can be
copy-pasted into a github issue.

I used this to generate https://github.com/orgs/pytorch/projects/43

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118874
Approved by: https://github.com/jamesjwu
2024-02-02 14:18:56 +00:00
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
2de327cedc Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561)
…dex that is not a valid class.

Fixes #117532.

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117561
Approved by: https://github.com/mikaylagawarecki
2024-02-02 11:03:16 +00:00
05ac295177 [export] Fix bug with user input mutations (#118942)
We hit an edge case where the graph exporting contains placeholder nodes whose names conflict with names from aot_export, we don't update the user_inputs_to_mutate in the graph signature correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118942
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2024-02-02 09:02:04 +00:00
cc46829f96 [Inductor] GEMM shape padding improvements (#118522)
Improvements to shape padding logic in torch/_inductor/pad_mm.py

These changes could lead up to 14% perf improvement for certain Meta internal models in experiments.

Most notably:
  * 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal )
 * 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True).
  * 3.) Enable shape padding for the Inductor CUDA  /  Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case.
  * Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False )
  * Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118522
Approved by: https://github.com/jansel, https://github.com/eellison
2024-02-02 08:50:06 +00:00
cyy
855d5f144e Relax MKL_INT assumption to int64_t (#118946)
When I built Pytorch on Windows with lastest MKL, it reported:
```
sources\pytorch\aten\src\ATen/cpu/vml.h(106): error C2338: static_assert failed: 'MKL_INT is assumed to be int32_t'
```
It should be safe to relax the restriction to int64_t.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118946
Approved by: https://github.com/ezyang
2024-02-02 07:11:47 +00:00
2964170f3a Revert "[optim] Rectify capturable testing and fix bugs! (#118326)"
This reverts commit d947b9d50011ebd75db2e90d86644a19c4fe6234.

Reverted https://github.com/pytorch/pytorch/pull/118326 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there are some relevant failures in trunk d947b9d500, may be a land race ([comment](https://github.com/pytorch/pytorch/pull/118326#issuecomment-1923125676))
2024-02-02 07:08:14 +00:00
4a5a2c6571 Update auto_functionalize schema (#118809)
- Moved the dictionary arguments to the node's kwargs as dicts are not
  valid inputs.
- Inlined the mutated arguments to the output. Originally, the output of
  auto_functionalize was the operator output
  and a list of mutated arguments (ex. [op_out1, op_out2, [mutated_arg1,
  mutated_arg2]]. However this is not easily exportable. Now, it will
  just be [op_out1, op_out2, mutated_arg1, mutated_arg2].

Differential Revision: [D53331040](https://our.internmc.facebook.com/intern/diff/D53331040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118809
Approved by: https://github.com/zou3519
2024-02-02 06:21:43 +00:00
89b7ab671e Protect against modules without __file__ (#117445)
The __file__ special variable is optional so should be treated as such.

Fixes #117109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117445
Approved by: https://github.com/oulgen, https://github.com/yanboliang
2024-02-02 06:06:50 +00:00
3d8c36786b Add device for distributed examples (#118867)
## 🐛 Describe the bug

The following example (`all_reduce`) missed `device` allocation
a205e7bf56/torch/distributed/distributed_c10d.py (L2080-L2087)

## Solution

A better example should be like this
a205e7bf56/torch/distributed/distributed_c10d.py (L3212-L3222)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118867
Approved by: https://github.com/soulitzer
2024-02-02 05:51:59 +00:00
da5cbb1269 [export] fix for duplicate constant lifting (#118776)
Summary:
Whenever we access a constant, we emit a `get_attr` node for it.

The `lift_constants_pass` was lifting every `get_attr` node unconditionally, even if the same target was already lifted. This diff fixes that.

I also took the liberty of adding some infra to make it easier to unit test passes. GraphBuilder lets you declaratively construct graphs with the right metadata, it's pretty useful for directly inducing the pattern you want to test against.

Test Plan: added unit test

Differential Revision: D53278161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118776
Approved by: https://github.com/angelayi, https://github.com/titaiwangms
2024-02-02 05:51:31 +00:00
32f48e917d [minimizer] Defined traverse (#118889)
Summary:
Add defined traverse mode for minimizer
it take user input start_idx and end_idx, form a subgraph, compare result from acclerators vs cpu

Differential Revision: D53318292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118889
Approved by: https://github.com/jfix71
2024-02-02 05:50:17 +00:00
3f1f057adf Remove parent device mesh check (#118620)
Removes raising error if a device_mesh has a parent.

The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are:
- this check
- https://github.com/pytorch/pytorch/pull/118618
- a series of PRs related to checkpointing with 3D meshes that I will open
We currently monkeypatch for the above which I am slowly upstreaming.

I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-02-02 05:29:49 +00:00
9cc6422ab6 Revert "[executorch hash update] update the pinned executorch hash (#118936)"
This reverts commit 8cc8cf75f31f7e430ab2918db4a2fb9c7b951024.

Reverted https://github.com/pytorch/pytorch/pull/118936 on behalf of https://github.com/suo due to conflicts with human change ([comment](https://github.com/pytorch/pytorch/pull/118936#issuecomment-1922824471))
2024-02-02 05:05:44 +00:00
8cc8cf75f3 [executorch hash update] update the pinned executorch hash (#118936)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936
Approved by: https://github.com/pytorchbot
2024-02-02 04:10:53 +00:00
497ea17684 Limit reductions into pointwise cat fusion (#118452)
@Chillee observed a regression when fusing the following:
```
        def f(a, b):
            return torch.cat([torch.softmax(a, dim=-1), torch.softmax(b, dim=-1)])
```

This PR limits pointwise concat/masked fusion in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118452
Approved by: https://github.com/jansel
2024-02-02 03:34:50 +00:00
babd6c776d [inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654)
Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512)

### Why?

Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`.

To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints.

This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model.

### Test

```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
OK (skipped=3)

$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols

# Before
Error: CUDA driver error: invalid argument
FAILED (errors=2, skipped=3)

# Now
OK (skipped=3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-02-02 03:19:52 +00:00
946ea47a4f [inductor] Fix an internal test issue (#118903)
Summary: test_add_complex4 that introduced in https://github.com/pytorch/pytorch/pull/117929  fails internally, because of a cpp compilation issue for cpu. Specify the right device in the test instead.

Differential Revision: [D53333919](https://our.internmc.facebook.com/intern/diff/D53333919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118903
Approved by: https://github.com/clee2000
2024-02-02 03:18:12 +00:00
8b729fb826 [ez] Fix CI log file piping error (#118807)
Fixes https://github.com/pytorch/pytorch/issues/118764

Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2024-02-02 03:07:56 +00:00
d947b9d500 [optim] Rectify capturable testing and fix bugs! (#118326)
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented  that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.

The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.

Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s

OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
2024-02-02 02:02:58 +00:00
08472a4fd5 [dtensor] add op support for aten.gather.default (#118513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118513
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-02-02 01:48:21 +00:00
8ca8729321 [PT-Vulkan][EZ] Adjust string-report width (#118914)
## Before: P1148506541

Some of the shader names are now too long.
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4322188
vulkan.nchw_to_image     {500, 500, 1}                    4322240
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1189240
vulkan.zero              {1, 1, 1}                           3744
vulkan.convert_channels_to_width_packed{125, 500, 1}                    1265680
```

## After: P1148506671

Now it's just right; `convert_channels_to_height_packed` is the longest shader name in the codebase.
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {500, 500, 1}                    4327232
vulkan.nchw_to_image                    {500, 500, 1}                    4327960
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1190540
vulkan.zero                             {1, 1, 1}                           3744
vulkan.convert_channels_to_width_packed {125, 500, 1}                    1287468
```

Differential Revision: [D53293924](https://our.internmc.facebook.com/intern/diff/D53293924/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118914
Approved by: https://github.com/liuk22
2024-02-02 01:43:48 +00:00
7e1ac59016 [pytorch][vulkan] add 1d tensor support for linear (#118690)
Summary: Vulkan Linear op doesn't support 1d tensors. We can unsqueeze 1d tensors to 2d to unblock the functionality.

Test Plan:
`LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*linear_*"`
```
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *linear_*
[==========] Running 11 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 11 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.linear_1d_small
[       OK ] VulkanAPITest.linear_1d_small (319 ms)
[ RUN      ] VulkanAPITest.linear_1d_large
[       OK ] VulkanAPITest.linear_1d_large (64 ms)
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (129 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (51 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (6 ms)
[----------] 11 tests from VulkanAPITest (578 ms total)

[----------] Global test environment tear-down
[==========] 11 tests from 1 test suite ran. (578 ms total)
[  PASSED  ] 11 tests.
```

Differential Revision: D53243201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118690
Approved by: https://github.com/jorgep31415, https://github.com/liuk22
2024-02-02 01:35:45 +00:00
796278b57e Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit 20484a193626ef72e0b3f35914f17deb2a89b8fc.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))
2024-02-02 01:19:19 +00:00
9153174cd1 [pt-vulkan] Introduce SharedObject class to ComputeGraph (#118756)
## Context

This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes).

This changeset builds upon the [previous PR enabling resource aliasing](https://github.com/pytorch/pytorch/pull/118436) and introduces the `SharedObject` class to `ComputeGraph`, which manages resource aliasing in graph execution mode. `SharedObject` tracks which `vTensor` values in a `ComputeGraph` share the same backing memory, and provides functionality to aggregate memory requirements and bind users to same memory allocation.

## Notes for Reviewers

The `SharedObject` class is introduced in `Graph.h`. It's fairly simple and provides three functions:

* `add_user()` which adds a `ValueRef` to the list of users of the `SharedObject`, and updates the aggregate memory requirements with the memory requirements of the new user
* `allocate_memory()` creates a `VmaAllocation` with the aggregated memory requirements
* `bind_users()` iterates over the `users` of the `SharedObject` and binds each `vTensor`'s underlying resource to the memory associated with the `SharedObject`.

As for how `SharedObject` is used in `ComputeGraph`:

* `add_tensor()` now has an additional argument `shared_object_idx` which, if `>0`, will construct a `vTensor` without any backing memory and add the new `vTensor` to the `SharedObject` at `shared_object_idx`
* `encode_execute()` will first iterate through the `SharedObject`s of the graph and allocate + bind users before recording the command buffer.

Differential Revision: [D53271486](https://our.internmc.facebook.com/intern/diff/D53271486/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118756
Approved by: https://github.com/jorgep31415, https://github.com/yipjustin
2024-02-02 01:19:00 +00:00
a5a63db3bf add Half support for flash attention on CPU (#118368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118368
Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/drisspg
ghstack dependencies: #118367
2024-02-02 01:08:39 +00:00
838c1c553e Add back recompile test (#118905)
Adds back a test that was skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118905
Approved by: https://github.com/janeyx99
2024-02-02 00:51:01 +00:00
4b59bfe8e5 [CI] Filter should not fail if pr_body is empty (#118934)
Otherwise it will fail with `TypeError: argument of type 'NoneType' is not iterable` (see https://github.com/pytorch/pytorch/actions/runs/7748725174/job/21131915226 for example)

```
% gh api /repos/pytorch/pytorch/issues/118927|
{
  "url": "https://api.github.com/repos/pytorch/pytorch/issues/118927",
  ...
  "body": null,
  ...
  "state_reason": null
}
```

TODO: Can we add a test for it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118934
Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/huydhn
2024-02-02 00:49:20 +00:00
08d90a1ea9 Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586)
Info about super in dynamic classes:
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions

Mainly doing this because it's making disable bot spam

Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped

Logs for `inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda`
https://ossci-raw-job-status.s3.amazonaws.com/log/21083466405
Afaik this PR doesn't actually cause the test to fail, it just surfaces the error since the mem leak check wasn't running previously

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586
Approved by: https://github.com/huydhn
2024-02-02 00:40:37 +00:00
7c609f01ff [PT-Vulkan] aten::conv1d - support any batch size (#118834)
Completes `aten::conv1d` implementation.

See D53204673 for full context.

Differential Revision: [D53253625](https://our.internmc.facebook.com/intern/diff/D53253625/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118834
Approved by: https://github.com/yipjustin
ghstack dependencies: #118833
2024-02-01 23:53:00 +00:00
dc4779b010 Split out fake_impls from fake_tensor (#118878)
The motivation is fake_tensor is marked as an uninteresting file for the purposes of backtraces, but operator implementations in fake tensor are interesting and I do want them reported.

How did I decide whether or not to move helper functions or not? It was kind of random, but if they weren't used in fake tensor generally I moved them over.

There are no functional code changes, so you only need to review the import changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118878
Approved by: https://github.com/eellison
2024-02-01 23:50:56 +00:00
844a76ebe8 [MPS][BE] Remove stale TODO (#118902)
And use convenient methods

TODO was added by an accidental copy-n-paste of code from https://github.com/pytorch/pytorch/pull/82315 into  https://github.com/pytorch/pytorch/pull/88532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118902
Approved by: https://github.com/kit1980
2024-02-01 23:43:23 +00:00
a16df1d85f [Dynamo] graph break on isinstance calls if we don't know the type (#118778)
If we can't figure out the python type of a VariableTracker, then the
isinstance call should graph break (instead of raising an error).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118778
Approved by: https://github.com/ydwu4
ghstack dependencies: #118768
2024-02-01 23:18:10 +00:00
39aab55c1c Add myself to CODEOWNERS for serialization-related files (#118892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118892
Approved by: https://github.com/albanD
2024-02-01 23:14:04 +00:00
46ef73505d Clarify how to get extra link flags when building CUDA/C++ extension (#118743)
Make it a bit more explicit how one parse linker arguments to the build and point to the superclass documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118743
Approved by: https://github.com/ezyang
2024-02-01 22:35:25 +00:00
dbba1d4bf5 Revert "Some minor type stub improvements (#118529)"
This reverts commit c978f38bd4aedeff4ee9ae693349217daea01412.

Reverted https://github.com/pytorch/pytorch/pull/118529 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118529#issuecomment-1922362331))
2024-02-01 22:18:36 +00:00
d4a94ad041 [ONNX] Fix upsample_bilinear2d decomp skip with output shape (#118823)
The previous output size missed the first two dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118823
Approved by: https://github.com/titaiwangms
2024-02-01 22:04:35 +00:00
6692f2c91e [no ci] Add myself to MPS codeowners (#118904)
I got pinged on every other PR anyway, so just a means to automate the process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118904
Approved by: https://github.com/albanD
2024-02-01 21:52:15 +00:00
6929322a28 [PT-Vulkan] aten::conv1d - support any channel-group combo (#118833)
## Main

Part of completing `aten::conv1d`'s implementation. See D53204673 for full context.

This diff relaxes the constraint
```
c_in = c_out = groups
```
to support any legal combination of c_in, c_out, groups.

From the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html), both c_in and c_out must be divisible by groups. Apart from that, any combo is now fair game.

## Additional

Improved GLSL comments and variable names, since more indices yield more headaches.

Differential Revision: [D53248767](https://our.internmc.facebook.com/intern/diff/D53248767/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118833
Approved by: https://github.com/yipjustin
2024-02-01 21:46:01 +00:00
61b572ed56 [inductor] more accurate throughput calculations for kernel benchmarks (#118858)
Our current throughput calculations for kernel benchmarks have some issues,
particularly when we slice inputs in the kernel. In such cases, we count
the original inputs as part of the memory traffic passed across the kernel.
This is incorrect because it may result in a much larger throughput
calculation, which can even exceed the theoretical bandwidth.

Instead, we should only count the size of the "slices" that contribute to
the actual memory traffic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858
Approved by: https://github.com/jansel
2024-02-01 21:42:14 +00:00
20484a1936 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-01 21:29:02 +00:00
54668ad6dc Cleanup max cuda device (#118779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779
Approved by: https://github.com/ezyang
2024-02-01 21:11:28 +00:00
f63dc9a21d s/DIRECLTY/DIRECTLY/ (#118877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118877
Approved by: https://github.com/albanD
2024-02-01 20:25:58 +00:00
923a7c7572 add test elipsis to dynamo test functions (#118754)
add tests to ensure the reported bug in #117563 is not failing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118754
Approved by: https://github.com/anijain2305
2024-02-01 19:05:01 +00:00
318e6ff40e Fix __name__ on a reconstructed NestedUserFunctionVariable (#118768)
```
def f():
    def g():
        return ()

    print(g.__name__)

f()
```

The following script should print `g` (with or without torch.compile),
but prints `f.<locals>.g` with torch.compile.

The problem looks like we use the co_qualname when reconstructing the
NestedUserFunctionVariable. I switched this over to use the co_name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118768
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-02-01 18:59:01 +00:00
b0e65dd1b4 Fix TCP Store Windows (#118860)
In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket.
Added missing call to addMiscellaneousSocket on Windows.

Fixes #118737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860
Approved by: https://github.com/awgu, https://github.com/malfet
2024-02-01 18:46:18 +00:00
df048f4da4 Revert "[RELAND] Remove deprecated fbgemm operators (#112153)"
This reverts commit 19e8ba95e535cd73d3eb37849f383ca8bab58603.

Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780))
2024-02-01 18:35:19 +00:00
0f7e63620f CUDA fast path for split_with_sizes_copy.out (#117203)
### Motivation
In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`):

All-gather output:
```
AAAABBBCCAAAABBBCC
```

After all-gather-copy-out:
```
AAAAAAAA  BBBBBB  CCCC
```

The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today.

We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD).

### all-gather-copy-out via Composing ATen Ops

Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows:

Reshape all-gather output as (world_size, -1):
```
AAAABBBCC
AAAABBBCC
```

`split_with_sizes` + `_foreach_copy_`:
```
AAAA BBB CC
AAAA BBB CC
```

However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons:
- The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high.
- `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy.
- `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy.
- Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads.

### PR
Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details.

### Benchmarks
The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time.

Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline.

Baseline:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147)
```
New kernel:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #118512
2024-02-01 18:23:01 +00:00
68f9c28e00 Don't make default arguments dynamic (#118772)
Noticed this while working on
https://github.com/pytorch/pytorch/issues/114590

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118772
Approved by: https://github.com/anijain2305
2024-02-01 18:11:57 +00:00
24dd9f42ce [MPS] Fix use_metal_mm condition (#118830)
One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)`

Extend test to account for this situation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830
Approved by: https://github.com/huydhn
2024-02-01 17:53:42 +00:00
3e79ef6db8 Complete decomposition for aten.round (#118635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118635
Approved by: https://github.com/peterbell10
2024-02-01 17:14:44 +00:00
0010b6145e Reduce register usage of fused adam(w) (#118361)
Part of #117872

| branch | cpu time avg (ms) | cuda time avg (ms) |
|--------|--------------|---------------|
| [main](eebe7e1d37f1baa995c694d540cc2fc98884fa18) | 13.430 | 144.117 |
| pr                                               | 13.371 | 49.655  |

Used torch profiler to measure the avg perf or 20 iterations.
Model is openlm-research/open_llama_7b_v2 (script is [here](https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1)).

---

PR
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        5.789s        46.42%        5.789s     289.456ms           0 b           0 b           0 b           0 b            20
                                          ProfilerStep*        36.02%        3.119s        67.19%        5.819s     290.958ms       0.000us         0.00%        2.586s     129.276ms      48.00 Kb      -1.47 Mb           0 b    -504.23 Gb            20
                                               aten::mm         2.57%     222.681ms         8.80%     762.415ms      56.475us        2.501s        20.05%        2.501s     185.255us           0 b           0 b     441.39 Gb     441.39 Gb         13500
       autograd::engine::evaluate_function: MmBackward0         0.10%       8.600ms         8.17%     707.935ms     157.319us       0.000us         0.00%        1.625s     361.098us           0 b           0 b     198.65 Gb    -135.03 Gb          4500
                                            MmBackward0         0.39%      33.896ms         7.99%     692.035ms     153.786us       0.000us         0.00%        1.601s     355.710us           0 b           0 b     330.84 Gb    -248.00 Mb          4500
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        1.007s         8.07%        1.007s      50.329ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     837.000us         3.36%     290.610ms      14.530ms       0.000us         0.00%     993.235ms      49.662ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.22%      18.825ms         3.35%     289.773ms      14.489ms       0.000us         0.00%     993.235ms      49.662ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.12%      10.823ms         3.09%     267.428ms      13.371ms     993.095ms         7.96%     993.095ms      49.655ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     993.095ms         7.96%     993.095ms     154.207us           0 b           0 b           0 b           0 b          6440
                                           aten::matmul         0.19%      16.140ms         1.73%     149.869ms      33.304us       0.000us         0.00%     876.000ms     194.667us           0 b           0 b     107.46 Gb           0 b          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     835.374ms         6.70%     835.374ms     185.639us           0 b           0 b           0 b           0 b          4500
                                           aten::linear         0.27%      23.268ms         1.97%     170.227ms      37.828us       0.000us         0.00%     776.278ms     172.506us           0 b           0 b     107.46 Gb      12.17 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     707.074ms         5.67%     707.074ms     183.180us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.31%     113.614ms         5.14%     445.405ms      22.125us     552.421ms         4.43%     552.780ms      27.459us     256.32 Kb     256.21 Kb     420.38 Gb     419.88 Gb         20131
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     442.209ms         3.55%     442.209ms     138.190us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.25%      21.336ms         5.00%     432.976ms      74.651us       0.000us         0.00%     398.627ms      68.729us           0 b           0 b     -45.71 Gb    -252.76 Gb          5800
                                             aten::add_         0.37%      31.975ms         7.19%     622.433ms      53.658us     391.957ms         3.14%     391.957ms      33.789us           0 b           0 b      -4.35 Gb      -4.35 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     345.037ms         2.77%     345.037ms     265.413us           0 b           0 b           0 b           0 b          1300
                                            aten::copy_         0.41%      35.727ms        20.62%        1.786s     146.503us     342.386ms         2.75%     342.386ms      28.092us      48.00 Kb      48.00 Kb     -56.00 Mb     -56.00 Mb         12188
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 8.661s
Self CUDA time total: 12.472s
```

main
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        7.671s        42.31%        7.671s     383.529ms           0 b           0 b           0 b           0 b            20
                                          ProfilerStep*        28.85%        3.050s        72.83%        7.700s     385.009ms       0.000us         0.00%        4.474s     223.678ms      48.00 Kb      -1.48 Mb           0 b    -504.45 Gb            20
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        2.896s        15.97%        2.896s     144.787ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     819.000us         2.75%     291.024ms      14.551ms       0.000us         0.00%        2.882s     144.125ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.17%      18.291ms         2.74%     290.205ms      14.510ms       0.000us         0.00%        2.882s     144.125ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.10%      10.893ms         2.54%     268.602ms      13.430ms        2.882s        15.90%        2.882s     144.117ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us        2.882s        15.90%        2.882s     447.570us           0 b           0 b           0 b           0 b          6440
                                               aten::mm         2.05%     217.136ms         7.21%     762.211ms      56.460us        2.499s        13.78%        2.499s     185.075us           0 b           0 b     441.37 Gb     441.37 Gb         13500
       autograd::engine::evaluate_function: MmBackward0         0.07%       7.179ms         6.77%     715.673ms     159.038us       0.000us         0.00%        1.624s     360.812us           0 b           0 b     198.65 Gb    -134.64 Gb          4500
                                            MmBackward0         0.32%      34.257ms         6.62%     700.088ms     155.575us       0.000us         0.00%        1.600s     355.460us           0 b           0 b     330.59 Gb    -628.00 Mb          4500
                                           aten::matmul         0.15%      15.892ms         1.32%     139.597ms      31.022us       0.000us         0.00%     874.861ms     194.414us           0 b           0 b     107.46 Gb           0 b          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     834.631ms         4.60%     834.631ms     185.474us           0 b           0 b           0 b           0 b          4500
                                           aten::linear         0.21%      22.460ms         1.51%     159.620ms      35.471us       0.000us         0.00%     774.772ms     172.172us           0 b           0 b     107.46 Gb      11.88 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     705.996ms         3.89%     705.996ms     182.901us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.06%     112.529ms         4.28%     452.473ms      22.488us     552.242ms         3.05%     552.266ms      27.447us     255.90 Kb     255.88 Kb     413.93 Gb     413.90 Gb         20121
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     441.514ms         2.44%     441.514ms     137.973us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.19%      20.517ms         4.18%     442.189ms      76.239us       0.000us         0.00%     398.552ms      68.716us           0 b           0 b     -45.57 Gb    -251.17 Gb          5800
                                             aten::add_         0.30%      31.703ms         6.01%     635.030ms      54.744us     391.897ms         2.16%     391.897ms      33.784us           0 b           0 b      -5.71 Gb      -5.71 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     344.972ms         1.90%     344.972ms     265.363us           0 b           0 b           0 b           0 b          1300
                                            aten::copy_         0.33%      34.415ms        34.75%        3.674s     301.437us     342.661ms         1.89%     342.661ms      28.115us      80.00 Kb      80.00 Kb    -240.00 Mb    -240.00 Mb         12188
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 10.574s
Self CUDA time total: 18.129s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118361
Approved by: https://github.com/janeyx99
2024-02-01 17:04:10 +00:00
b73a2b7795 [ait] inspect get_attr nodes for _decline_if_input_dtype (#118760)
Summary:
previously get_attr nodes were skipped, but for example:

%mul_240 : [num_users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.mul](args = (), kwargs = {input: %_fx_const_folded_attrs_13, other: %add_143})

where %_fx_const_folded_attrs_13 is int64, but add_143 is float causes issues if skipped, e.g. "unsupported dtype='int64' for alignments"

Differential Revision: D53273467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118760
Approved by: https://github.com/khabinov
2024-02-01 15:56:15 +00:00
ff9ce94489 Create empty host tensor for privateuseone (#118854)
For the H2D copy of local_used_map_ on the privateuseone device, reuse the CUDA logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118854
Approved by: https://github.com/ezyang
2024-02-01 15:32:55 +00:00
d790c1dca6 [CUDA][cuDNN][TF32] Misc TF32 updates (#118781)
Twiddle some thresholds that don't seem to play nice with sm90.

CC @tinglvv @nWEIdia @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118781
Approved by: https://github.com/ezyang
2024-02-01 15:32:50 +00:00
687946eea1 [FSDP2] Added reduce-scatter (#117975)
This PR adds the FSDP reduce-scatter (the copy-in/reduce-scatter collective/view-out).
- We use gradient pre- and post-divide factors like existing FSDP (mainly for fp16 reduction).
- We use a separate CUDA stream for the reduce-scatter to conveniently handle additional kernels surrounding the collective as a separate 'thread of execution' (e.g. pre/post-divide and later the D2H gradient offload).
- ~~The implementation in this PR is more complicated to _try_ to reduce CPU overhead by using `torch.split` instead of a Python for-loop. The challenge comes from the fact that the autograd-computed unsharded gradients do not have padding. We prefer to not do an intermediate padding step and instead directly copying to the big reduce-scatter input.~~ For simplicity, I changed the implementation to include intermediate padding steps, as it can still achieve ~250 GB/s, and it avoids any `O(NP)` tensor materialization for world size `N` and `P` `nn.Parameter`s.

<details>
<summary> Recall: Copy-in/All-Gather/Copy-Out Example </summary>

Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB

Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB

Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB

Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB

Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>

<details>
<summary> Copy-in/Reduce-Scatter/View-Out Example </summary>

Suppose we have 2 gradients with shapes `(3, 3)` (denoted with `a`s when not-yet-reduced and `A`s after reduced) and `(2, 2)` (denoted with `b`s and `B`s similarly) and 2 ranks, where `E` represents empty:
```
Given from autograd:
(3, 3): aaaaaaaaa
(2, 2): bbbb

Unsharded gradients/reduce-scatter inputs (no padding!):
Rank 0: aaaaaaaaa, bbbb
Rank 1: aaaaaaaaa, bbbb

Each rank allocate group's reduce-scatter input:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: aaaaaabbaaaEEEbb
Rank 1: aaaaaabbaaaEEEbb

Each rank :
Rank 0: AAAAAABBAAAEEEBB
Rank 1: AAAAAABBAAAEEEBB

Each rank view-out:
Rank 0: AAAAAA BB
Rank 1: AAA, BB
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117975
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
ghstack dependencies: #117950, #117955, #117973
2024-02-01 15:21:37 +00:00
9c2b43cc50 [inductor] Handle special values correctly in ir.Scan codegen (#118788)
Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This
is a fairly minor bugfix that has not come up since the only two scan
ops with lowerings use "normal" values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788
Approved by: https://github.com/peterbell10
2024-02-01 14:54:20 +00:00
221747507d Revert "[export] support non-persistent buffers (#118612) (#118722)"
This reverts commit a43c28368c184ba1bf964f4fb99bec300917e2f4.

Reverted https://github.com/pytorch/pytorch/pull/118722 on behalf of https://github.com/atalman due to broke linux-jammy-py3-clang12-executorch ([comment](https://github.com/pytorch/pytorch/pull/118722#issuecomment-1921484565))
2024-02-01 14:39:29 +00:00
4a5a3bcc89 Revert "fused adam(w): Reduce register usage (#117872)"
This reverts commit b8e71cf3022e701604ea1f0c381c0b9ccf8743be.

Reverted https://github.com/pytorch/pytorch/pull/117872 on behalf of https://github.com/janeyx99 due to This was not intended to be merged ([comment](https://github.com/pytorch/pytorch/pull/117872#issuecomment-1921425677))
2024-02-01 14:15:00 +00:00
a1dd367716 Fixed error in bicubic upsampling aa=false for uint8 input (#118389)
Description:
- Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite:
```diff
- self.assertLess(diff.max(), 15)
+ self.assertLess(diff.max(), 5)
```
While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1.

- Renamed methods
- The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values

More details on the bug:
For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as
```
out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ])
```
When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly:
```
-- output index i= 0
regular float32 approach:
source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1]
interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001]

separable uint8 approach:
source indices coming from index ranges (min, size): [0, 1]
incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0]
fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0]
Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach.
```

Quick benchmark to ensure perfs no regression:

```
[------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------]
                                                                               |  torch (2.3.0a0+gitfda85a6) PR  |  torch (2.3.0a0+git0d1e705) Nightly  |  Speed-up: PR vs Nightly
1 threads: -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False  |        440.996 (+-2.044)        |          470.824 (+-5.927)           |      1.068 (+-0.000)
      3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False   |        463.565 (+-1.519)        |          497.231 (+-10.825)          |      1.073 (+-0.000)
      3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False  |       1717.000 (+-28.589)       |         1915.570 (+-43.397)          |      1.116 (+-0.000)
      3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False   |       1801.954 (+-22.391)       |         1981.501 (+-37.034)          |      1.100 (+-0.000)
      3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False   |        199.599 (+-0.851)        |          196.535 (+-3.788)           |      0.985 (+-0.000)
      3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False    |        243.126 (+-0.681)        |          240.695 (+-2.306)           |      0.990 (+-0.000)
      3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False   |        686.270 (+-2.870)        |          687.769 (+-17.863)          |      1.002 (+-0.000)
      3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False    |        899.509 (+-5.377)        |          899.063 (+-9.001)           |      1.000 (+-0.000)

Times are in microseconds (us).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389
Approved by: https://github.com/NicolasHug
ghstack dependencies: #118388
2024-02-01 14:14:32 +00:00
cyy
8b140da804 Use MKL_INT in MKL wrapper interfaces (#118734)
I encountered the error when built PyTorch on Windows MKL:

```
pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): error C2664: “void cblas_sgemm_batch(const CBLAS_LAYOUT,const CBLAS_TRANSPOSE *,const CBLAS_TRANSPOSE *,const __int64 *,const __int64 *,const __int64 *,const float *,const float **,const __int64 *,const float **,const __int64 *,const float *,float **,const __int64 *,const __int64,const __int64 *) noexcept”: 无法将参数 4 从“const int *”转换为“const __int64 *”
pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): note: 指向的类型不相关; 转换需要 reinterpret_cast、C 样式强制转换或带圆括号的函数样式强制转换
C:\Program Files (x86)\Intel\oneAPI\2024.0\include\mkl_cblas.h(550): note: 参见“cblas_sgemm_batch”的声明
```
This was because MKL_INT was defined as int64_t. This PR tries to use MKL_INT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118734
Approved by: https://github.com/ezyang
2024-02-01 13:32:28 +00:00
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
eaa45f47f8 [sigmoid] fix for torchbind serialization (#118791)
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.

This serializes in a different format than TorchScript does, which uses the TS C++ pickler.

The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.

Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```

Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.

Reviewed By: SherlockNoMad

Differential Revision: D53248454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
2024-02-01 10:09:07 +00:00
0dc15ff674 [reland][export] Fix graph signature for primitive outputs (#118818)
Summary: Reland of D53233649/https://github.com/pytorch/pytorch/pull/118655. Previously I didn't realize there was a use-case of a torchbind object as an input to the graph, so I didn't mark `CustomObjArgument` as a valid input, which broke [this test](a43c28368c/test/export/test_torchbind.py (L81)). Somehow the initial CI did not catch it, but hud was sad so that PR was reverted. So now I added `CustomObjArgument` as valid input [here](https://github.com/pytorch/pytorch/pull/118818/files#diff-92420f977c3a02b2deadf6752ce4a9ee601c20612a1a13cc365252eb09410edbR298).

Test Plan: CI

Reviewed By: tarun292

Differential Revision: D53288445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118818
Approved by: https://github.com/ydwu4
2024-02-01 09:59:05 +00:00
b8e71cf302 fused adam(w): Reduce register usage (#117872)
As per title, reducing register usage for better occupancy.

Changes are:
- use 32bit indexing if possible
- convert some arguments of fused adam(w) functor to its template parameters
- give `const` to some arguments

Tables below are before/after of adamw for sm90 with / without amsgrad enabled.

### without amsgrad
| dtype | main | this PR |
|-------|------|---------|
| bf16  | 79   | 64      |
| fp16  | 82   | 64      |
| fp32  | 126  | 64      |
| fp64  | 128  | 109     |

### with amsgrad
| dtype | main | this PR |
|-------|------|---------|
| bf16  | 124  | 74      |
| fp16  | 124  | 74      |
| fp32  | 123  | 76      |
| fp64  | 128  | 121     |

---

`AdamW(..., fused=True)` with llama-2 bf16 on H100 improved to 49.935ms of cuda avg time from 126.648ms according to torch profiler.

This PR:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        5.878s        46.47%        5.878s     293.918ms           0 b           0 b           0 b           0 b            20
                                               aten::mm         2.57%     224.777ms         8.50%     741.993ms      54.962us        2.591s        20.48%        2.591s     191.910us           0 b           0 b     441.39 Gb     441.39 Gb         13500
                                          ProfilerStep*        31.64%        2.763s        67.67%        5.910s     295.485ms       0.000us         0.00%        2.551s     127.547ms      48.00 Kb      -1.44 Mb           0 b    -506.38 Gb            20
       autograd::engine::evaluate_function: MmBackward0         0.13%      11.349ms         7.90%     690.160ms     153.369us       0.000us         0.00%        1.726s     383.544us           0 b           0 b     198.65 Gb    -137.53 Gb          4500
                                            MmBackward0         0.45%      38.959ms         7.68%     670.399ms     148.978us       0.000us         0.00%        1.693s     376.326us           0 b           0 b     332.81 Gb       2.26 Gb          4500
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        1.012s         8.00%        1.012s      50.617ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     846.000us         3.39%     296.240ms      14.812ms       0.000us         0.00%     998.876ms      49.944ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.26%      23.113ms         3.38%     295.394ms      14.770ms       0.000us         0.00%     998.876ms      49.944ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.13%      11.000ms         3.08%     268.545ms      13.427ms     998.705ms         7.89%     998.705ms      49.935ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     998.705ms         7.89%     998.705ms     155.078us           0 b           0 b           0 b           0 b          6440
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     872.287ms         6.90%     872.287ms     193.842us           0 b           0 b           0 b           0 b          4500
                                           aten::matmul         0.19%      16.721ms         1.82%     159.130ms      35.362us       0.000us         0.00%     864.840ms     192.187us           0 b           0 b     107.46 Gb           0 b          4500
                                           aten::linear         0.28%      24.641ms         2.09%     182.129ms      40.473us       0.000us         0.00%     765.554ms     170.123us           0 b           0 b     107.46 Gb      12.46 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     690.729ms         5.46%     690.729ms     178.945us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.36%     118.465ms         4.89%     427.071ms      21.225us     549.580ms         4.34%     549.697ms      27.320us     224.03 Kb     223.96 Kb     413.51 Gb     413.36 Gb         20121
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     484.455ms         3.83%     484.455ms     151.392us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.27%      23.176ms         4.63%     404.534ms      69.747us       0.000us         0.00%     406.155ms      70.027us           0 b           0 b     -46.01 Gb    -257.12 Gb          5800
                                             aten::add_         0.39%      34.186ms         7.22%     630.849ms      54.384us     394.402ms         3.12%     394.402ms      34.000us           0 b           0 b      -6.68 Gb      -6.68 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     366.653ms         2.90%     366.653ms     282.041us           0 b           0 b           0 b           0 b          1300
                                            aten::copy_         0.41%      35.934ms        20.61%        1.800s     147.691us     341.572ms         2.70%     341.572ms      28.025us      48.00 Kb      48.00 Kb     -40.00 Mb     -40.00 Mb         12188
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 8.733s
Self CUDA time total: 12.651s

AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=846.000us cpu_time=14.812ms  self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=23.113ms cpu_time=14.770ms  self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=1.012s cuda_time=50.617ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>

```

Main
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        7.354s        42.89%        7.354s     367.698ms           0 b           0 b           0 b           0 b            20
                                          ProfilerStep*        28.22%        2.875s        72.48%        7.384s     369.184ms       0.000us         0.00%        4.067s     203.325ms      48.00 Kb      -1.48 Mb           0 b    -508.04 Gb            20
                                               aten::mm         2.24%     228.499ms         7.13%     726.223ms      53.794us        2.563s        14.95%        2.563s     189.873us           0 b           0 b     441.39 Gb     441.39 Gb         13500
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        2.546s        14.85%        2.546s     127.304ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     821.000us         2.87%     292.871ms      14.644ms       0.000us         0.00%        2.533s     126.654ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.22%      22.801ms         2.87%     292.050ms      14.602ms       0.000us         0.00%        2.533s     126.654ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.11%      11.332ms         2.61%     265.853ms      13.293ms        2.533s        14.77%        2.533s     126.648ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us        2.533s        14.77%        2.533s     393.315us           0 b           0 b           0 b           0 b          6440
       autograd::engine::evaluate_function: MmBackward0         0.13%      13.342ms         6.73%     685.250ms     152.278us       0.000us         0.00%        1.706s     379.209us           0 b           0 b     198.65 Gb    -138.02 Gb          4500
                                            MmBackward0         0.38%      38.974ms         6.52%     664.652ms     147.700us       0.000us         0.00%        1.675s     372.113us           0 b           0 b     333.59 Gb       2.75 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     859.515ms         5.01%     859.515ms     191.003us           0 b           0 b           0 b           0 b          4500
                                           aten::matmul         0.16%      16.431ms         1.49%     152.052ms      33.789us       0.000us         0.00%     856.839ms     190.409us           0 b           0 b     107.46 Gb           0 b          4500
                                           aten::linear         0.23%      23.703ms         1.72%     174.862ms      38.858us       0.000us         0.00%     758.995ms     168.666us           0 b           0 b     107.46 Gb      12.21 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     682.302ms         3.98%     682.302ms     176.762us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.16%     117.854ms         4.12%     420.100ms      20.892us     544.045ms         3.17%     544.157ms      27.062us     240.38 Kb     240.34 Kb     419.45 Gb     419.29 Gb         20108
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     479.767ms         2.80%     479.767ms     149.927us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.27%      27.303ms         3.95%     402.627ms      69.418us       0.000us         0.00%     403.020ms      69.486us           0 b           0 b     -45.56 Gb    -257.26 Gb          5800
                                             aten::add_         0.32%      32.543ms         6.08%     619.248ms      53.383us     393.242ms         2.29%     393.242ms      33.900us           0 b           0 b      -6.21 Gb      -6.21 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     363.245ms         2.12%     363.245ms     279.419us           0 b           0 b           0 b           0 b          1300
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     338.460ms         1.97%     338.460ms      29.228us           0 b           0 b           0 b           0 b         11580
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 10.187s
Self CUDA time total: 17.145s

AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=821.000us cpu_time=14.644ms  self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=22.801ms cpu_time=14.602ms  self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=2.546s cuda_time=127.304ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
```

Script I used: https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1

<!--

## adamw

### This PR

```console
$ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_impl.sm_70.cubin
Extracting ELF file    2: fused_adamw_impl.sm_80.cubin
Extracting ELF file    3: fused_adamw_impl.sm_90.cubin
$ cuobjdump -res-usage fused_adamw_impl.sm_90.cubin | cu++filt

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:64 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:109 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
```

### Main

```console
$ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_impl.cu.1.sm_70.cubin
Extracting ELF file    2: fused_adamw_impl.cu.2.sm_80.cubin
Extracting ELF file    3: fused_adamw_impl.cu.3.sm_90.cubin
$ cuobjdump -res-usage fused_adamw_impl.cu.3.sm_90.cubin | cu++filt

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:79 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:82 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:126 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
```

## adamw & amsgrad
### This PR
```console
root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_amsgrad_impl.sm_70.cubin
Extracting ELF file    2: fused_adamw_amsgrad_impl.sm_80.cubin
Extracting ELF file    3: fused_adamw_amsgrad_impl.sm_90.cubin
root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.sm_90.cubin

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:76 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:121 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
```

### Main
```console
root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_amsgrad_impl.cu.1.sm_70.cubin
Extracting ELF file    2: fused_adamw_amsgrad_impl.cu.2.sm_80.cubin
Extracting ELF file    3: fused_adamw_amsgrad_impl.cu.3.sm_90.cubin
root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usave fused_adamw_amsgrad_impl.cu.3.sm_90.cubin
cuobjdump fatal   : Unknown option 'res-usave'
root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.cu.3.sm_90.cubin

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:123 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
```

-->
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117872
Approved by: https://github.com/janeyx99
2024-02-01 09:34:50 +00:00
eba4bd6b86 Updated test_upsamplingBiMode2d_consistency (#118388)
Description:
- Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype
- Updated out-dated comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388
Approved by: https://github.com/NicolasHug
2024-02-01 09:22:23 +00:00
7e0ea0d5df [export] Only deepcopy graph in unlift (#118821)
Summary: We only need to deepcopy the graph because we're modifying the graph by unlifting its parameter/buffer inputs. We don't need to deepcopy the graph module state/contents. This causes an error when the graph module contains an ExecuTorch LoweredModule which stores tensors.

Test Plan: Fixes the following diff

Differential Revision: D53290077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118821
Approved by: https://github.com/tugsbayasgalan
2024-02-01 09:00:22 +00:00
4fc4f5eb06 [Dynamo] Support tensor is not tensor (#118840)
Fixes Meta internal use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118840
Approved by: https://github.com/yf225
2024-02-01 07:32:43 +00:00
a1280f0cc6 Add an OpInfo test for split_with_sizes_copy (#118512)
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.

Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
2024-02-01 07:09:27 +00:00
2b48891e62 [AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765)
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.

The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.

Test Plan: Unit tests included in commit.

Differential Revision: D53274382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
2024-02-01 04:54:25 +00:00
b97ab47619 [pytorch][ao] Update PerChannelMinMaxObserver default _load_from_state_dict (#118659)
Summary:
When `version` is missing in the metadata, use `min_val/max_val` as keys instead of `max_vals/min_vals`

## Reasons
1. It's been almost 2 years since this change D30003700, which means now most checkpoints are using the `max_val/min_val` keys

2. most checkpoints dumps using `model.state_dict()` don't have version info, which will lead a fake `missing keys` error when loading state_dict

Test Plan: CI

Differential Revision: D53233012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118659
Approved by: https://github.com/jerryzh168
2024-02-01 04:39:31 +00:00
526701cfb7 [executorch hash update] update the pinned executorch hash (#118698)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118698
Approved by: https://github.com/pytorchbot
2024-02-01 03:39:50 +00:00
45d2dff844 [easy] Enable test_neg_view for 5D SampleInput for torch.nn.functional.linear (#118815)
Fixes #117854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118815
Approved by: https://github.com/malfet
2024-02-01 03:26:45 +00:00
adff335095 [vision hash update] update the pinned vision hash (#118825)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118825
Approved by: https://github.com/pytorchbot
2024-02-01 03:14:16 +00:00
9b28621369 [FSDP2] Added forward unshard/wait for unshard/reshard (#117973)
This PR adds the all-gather and free logic required for forward.
- We define the logical all-gather as two ops: (1) unshard and (2) wait for unshard. This abstraction allows capturing both implicit forward prefetching (using multiple streams and `async_op=False`) and explicit forward prefetching (using `async_op=True`).
- Symmetrically, we define the reshard op to free the unsharded parameters.

Some other notes:
- The `FSDPParamGroup` and its `FSDPParam`s transition their sharded states together. This invariant allows us to reason about the parameters by group rather than individually with respect to whether they are sharded or unsharded.

---

### How Does the Overlap Work for All-Gather?

For context, the all-gather consists of three steps: (1) copy-in, (2) all-gather collective, and (3) copy-out.

<details>
<summary> Example </summary>

Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB

Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB

Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB

Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB

Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>

`dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream before running the collective. `async_op=True` means that the function waits on the work, having the current stream wait for the NCCL stream before returning. `async_op=False` means it returns the `Work` object, which the user can wait on later.

#### Implicit Prefetching
Implicit prefetching achieves communication/computation overlap without changing the CPU issue order:
- We use separate streams for copy-in and for issuing the `dist.all_gather_into_tensor()`. The copy-in stream allows us to overlap the copy-in with all-gather/reduce-scatter in backward, and the all-gather stream allows us to overlap the all-gather with forward compute (issued before it).
     - Because `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream, we need this "dummy" all-gather stream to prevent the all-gather from waiting on the forward compute with which it should overlap.
     - Without the separate copy-in stream, we cannot overlap all-gather copy-in with all-gather in forward.
- We copy-out in the default stream after having the default stream wait for the all-gather. This means that the autograd leaves are allocated in the default stream and autograd will not call `recordStream`.

Implicit prefetching does not require knowing the execution order ahead of time. However, when overlapping the next all-gather with the current compute, there may be a gap from the CPU thread issuing the current compute. If the CPU thread can run ahead, then this is not an issue.

#### Explicit Prefetching
Explicit prefetching achieves communication/computation by changing the CPU issue order, namely by reordering the all-gather to be before the compute with which it should overlap.
- Because we reorder, we do not need any separate streams, and we can use `async_op=False` for overlap.
- We can expose this explicit prefetching as a module-level `unshard()` op (e.g. `module.unshard(async_op: bool)`, and we can use it as a primitive for implementing the explicit forward prefetching in existing FSDP.

Explicit prefetching requires knowing the execution order.

---

Disclaimer: The testing is relatively lighter in this PR. I did not want to spend too much time writing new forward-only tests. The stream usage will be exercised thoroughly once we have backward too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117973
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
ghstack dependencies: #117950, #117955
2024-02-01 03:08:13 +00:00
8d6e34b21b Add verbose option to failures histogram (#118757)
Sample output: https://gist.github.com/jamesjwu/cc80d7da305add0a69c5e39aae09a077
Using directories from https://hud.pytorch.org/pr/118597:
eager_tests: [linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034340833)
dynamo_tests: [linux-focal-py3.11-clang10 / test (dynamo, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034342747)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118757
Approved by: https://github.com/zou3519
2024-02-01 02:46:36 +00:00
499f31d40b [dynamo] use par_style = "xar" in minifier targets file (#118603)
For internal usage, par_style="xar" is needed in order for certain build
modes to work with triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118603
Approved by: https://github.com/williamwen42
2024-02-01 02:42:26 +00:00
a43c28368c [export] support non-persistent buffers (#118612) (#118722)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1769

Basic support for non-persistent buffers, which are buffers that do not show up in the state dict.

One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict.

This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them.

Test Plan: added a unit test

Differential Revision: D53253905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118722
Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi
2024-02-01 00:36:09 +00:00
4cba1dd0c3 [submodule] Update cudnn_frontend to v1.0.3 (#118782)
# Summary
Updates cudnn frontend to tagged 1.0.3 tagged version

submodule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118782
Approved by: https://github.com/malfet
2024-02-01 00:35:03 +00:00
suo
2f79a7bf9e [export] make spec comparison indifferent to fx collections (#118718)
Treat immutable_dict as dict and immutale_list as list. This behavior was tripped up by some executorch tests

Differential Revision: [D53252679](https://our.internmc.facebook.com/intern/diff/D53252679/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118718
Approved by: https://github.com/zhxchen17
2024-02-01 00:10:49 +00:00
6c67f3333a [Inductor] Skip triton templates for mixedmm on SM70- (#118591)
As it results in numerical errors, see https://github.com/pytorch/pytorch/issues/117144

Fixes https://github.com/pytorch/pytorch/issues/117144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118591
Approved by: https://github.com/jansel
2024-01-31 23:30:45 +00:00
da4b4d961e Support printing storage while FakeTensorMode is enabled (#118780)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118780
Approved by: https://github.com/thiagocrepaldi, https://github.com/eellison
2024-01-31 23:10:47 +00:00
30f43e3d89 [ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710)
Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device.
This PR modifies the script to deepcopy and export the model on another device when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710
Approved by: https://github.com/thiagocrepaldi
2024-01-31 23:03:39 +00:00
21ce53b9c5 Add inf norm support for _foreach_norm (#118441)
Fixes #117803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118441
Approved by: https://github.com/mlazos
2024-01-31 22:58:51 +00:00
e87ac82c98 Fix missing default dim param in weight norm interface decomp (#118762)
Fix for https://github.com/pytorch/pytorch/issues/118742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118762
Approved by: https://github.com/ezyang, https://github.com/shunting314
2024-01-31 22:10:10 +00:00
e426924c19 Change classification to beta for TORCH_LOGS (#118682)
Changes classification of TORCH_LOGS to beta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682
Approved by: https://github.com/svekars
2024-01-31 21:50:55 +00:00
fb391a016d Test that optimizers are running cudagraphs (#118716)
Updates compiled optimizer tests to ensure that cudagraphs is running when on cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118716
Approved by: https://github.com/eellison
2024-01-31 21:34:23 +00:00
8dee7b7a16 Add TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED (#118750)
This allows us to request extended (including C++ backtrace) information
whenever a specific guard occurs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118750
Approved by: https://github.com/aakhundov
2024-01-31 21:16:27 +00:00
c978f38bd4 Some minor type stub improvements (#118529)
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
2024-01-31 20:56:56 +00:00
5ced432a0d Revert "[export] Fix graph signature for primitive outputs (#118655)"
This reverts commit 680cc6b17ab3f318c0da6177646afe6700152327.

Reverted https://github.com/pytorch/pytorch/pull/118655 on behalf of https://github.com/atalman due to broke TestExportTorchbind.test_input test ([comment](https://github.com/pytorch/pytorch/pull/118655#issuecomment-1919940598))
2024-01-31 20:55:46 +00:00
a768a50a55 Re-enable test_nan_to_num (#118711)
Resolve TODO and re-enable as https://github.com/pytorch/pytorch/issues/82763 is resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118711
Approved by: https://github.com/peterbell10
2024-01-31 20:01:10 +00:00
9391af9796 Merging heuristics (#118029)
Everyday I move closer and closer to just using numbers

* number of heuristics that marked it as high, probable, low, none etc
* order of heuristics in the `__init__` file as well as how the heuristic ordered the tests
* put heuristics historical edited files and profiling as not trial mode
* briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029
Approved by: https://github.com/huydhn
2024-01-31 20:00:10 +00:00
3280fdb883 [FSDP2] Added _to_kwargs root forward input cast (#117955)
This PR adds a `_to_kwargs()` call on the FSDP root module's forward inputs to move them to `device` similar to DDP.
39df084001/torch/nn/parallel/distributed.py (L1426-L1427)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117955
Approved by: https://github.com/weifengpy
ghstack dependencies: #117950
2024-01-31 19:51:32 +00:00
d33f9dcefe [FSDP2] Added all-gather and unsharded parameter (#117950)
This PR adds the FSDP all-gather (the copy-in/all-gather collective and the copy-out) and the unsharded parameter concept to `FSDPParam`. This is to prepare for being able to run the forward pass.
- We implement all-gather as two functions: `foreach_all_gather` (copy-in/all-gather collective) and `foreach_all_gather_copy_out` (copy-out).
    - In the future, there will be two paths: `async_op=True` in the default stream for explicit prefetching and `async_op=False` in separate streams for implicit prefetching.
    - In the future, we will use `torch.split_with_sizes_copy` in the copy-out when it has the CUDA fast path.
    - We have the functions operate on `List[FSDPParam]` instead of passing the `torch.Tensor` and metadata mainly so that the `all_gather_input` can be computed under the `all_gather_copy_in_stream`. Since the two functions are specific to FSDP, I did not see motivation for avoiding this at the cost of entering/exiting the `all_gather_copy_in_stream` context twice (which incurs some CPU overhead).
- The `init_all_gather_output()` and `init_unsharded_parameter()` functions may seem unintuitive. The reason we initialize them once and write to them in-place thereafter is for autograd. See the note `[Note: FSDP and autograd]` in the code.
- We expand our 'FSDP tensors' definition to include the all-gather input and all-gather output in addition to the sharded and unsharded parameters. This distinction might seem unnecessary or pedantic, but it enables a language for describing pre- and post-all-gather transformations.
- We use the `_unsafe_preserve_version_counters` context when copying out because otherwise autograd will complain of a version mismatch in backward due to writing to the leaf tensors. (An alternative would be to use `.data`, but we are avoiding that 😄 .)

---

<details>
<summary> Copy-in/All-Gather/Copy-Out Example </summary>

Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB

Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB

Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB

Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB

Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>

---

For context, we use the copy-in/all-gather/copy-out strategy instead of NCCL group coalescing for two reasons:
1. One large NCCL all-gather is still noticeably faster than several NCCL all-gathers using group coalescing of the same total bytes (even after NCCL 2.18.3). We prefer to tradeoff extra device-to-device copies (using GPU high-bandwidth memory) to save communication time, which does not improve as fast from hardware generation to generation.
2. Copying out of the all-gather buffer tensor simplifies multi-stream memory handling because there is a constant number of such all-gather tensors alive at once. (The copy-out is done in the default/compute stream.) If we directly used the all-gather tensor memory for computation, then the number of such alive tensors is linear in the module depth and hence dependent on the particular model.

---

Disclaimer: This PR has some extraneous code, but I did not want to simplify too much since that code will be added back soon anyway (e.g. for overlapping, mixed precision, and ZeRO++). Hopefully it does not hinder code review too much.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117950
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-01-31 19:51:32 +00:00
483001e846 Revert "Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586)"
This reverts commit f2682e75e6fd735c4a84afe59eafd541f7643f4a.

Reverted https://github.com/pytorch/pytorch/pull/118586 on behalf of https://github.com/atalman due to Broke slow tests ([comment](https://github.com/pytorch/pytorch/pull/118586#issuecomment-1919810802))
2024-01-31 19:44:29 +00:00
649f2e3000 Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300)
Summary:
The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds.

This change ensures that all registers_ accesses are in bounds or an exception will be thrown.

Test Plan: contbuild + OSS signals

Differential Revision: D49748737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300
Approved by: https://github.com/malfet, https://github.com/kimishpatel
2024-01-31 19:40:02 +00:00
8026534a2f Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929)
Fixes #117370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-01-31 19:34:58 +00:00
82b6ee5a2a Fix build error in ppc64le (#118516)
...
from /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/test/vec_test_all_types.cpp:1: /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h: In member function 'bool at::vec::DEFAULT::Vectorized::has_inf_nan() const': /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:244:36: error: no matching function for call to 'at::vec::DEFAULT::Vectorized::_isinf(float&) const' 244 | if(_isnan(_vec0[i]) || _isinf(_vec0[i])) {
| ~~~~~~^~~~~~~~~~
/home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:237:21: note: candidate: 'at::vec::DEFAULT::Vectorized at::vec::DEFAULT::Vectorized::_isinf() const'~ ...
Started breaking from
29516bd2a0.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118516
Approved by: https://github.com/ezyang
2024-01-31 19:33:57 +00:00
aca41a3a74 [optim] lbfgs: handle complex params as independent real params (#118184)
Ref: #86340

Fixes #118148

This fixes LBFGS for complex parameters. Complex parameters are handled as R^2.
I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers.
Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable.
We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters.

Let me know if the approach is ok, thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184
Approved by: https://github.com/janeyx99
2024-01-31 19:24:16 +00:00
82b0341af3 s/verison/version/ (#118749)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118749
Approved by: https://github.com/malfet, https://github.com/albanD
2024-01-31 19:23:55 +00:00
41dfd0e063 Update Dynamo passrate/histogram scripts (#118752)
Changelog:
- Don't count running PYTORCH_TEST_WITH_DYNAMO=1 on dynamo/ tests in the pass
rate. This was a bug (we were counting all of these as failing, but in
reality, most of these pass). The net effect is that the passrate is (artifically)
6% higher.
- Have the histogram script filter out skips based on the passrate metric.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118752
Approved by: https://github.com/jamesjwu
2024-01-31 19:15:17 +00:00
99b69e1ffb add PrivateUse1 device support in function options_from_string. (#118627)
add PrivateUse1 device support in function options_from_string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118627
Approved by: https://github.com/soulitzer
2024-01-31 18:52:58 +00:00
7aff92c838 [torch] Expose dynamic_shapes api at multiple levels (#118695)
Summary: Exposes `dynamic_shapes` api at multiple levels so it's easier to replace the old API `dynamic_dim()` with the new API `Dim()`.

Test Plan: CI

Differential Revision: D53246409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118695
Approved by: https://github.com/ydwu4
2024-01-31 18:50:01 +00:00
6bd1807ae9 enable mkl_gemm_f16f16f32 in cpublas::gemm (#118367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118367
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-31 18:37:42 +00:00
81d12846dc Add decomp for pixel_shuffle/unshuffle (#118239)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118239
Approved by: https://github.com/peterbell10
2024-01-31 18:34:21 +00:00
81b55f58ce Matmul decide should_fold using has_out instead of grad_mode (#118617)
Fixes https://github.com/pytorch/pytorch/issues/118548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118617
Approved by: https://github.com/lezcano
2024-01-31 18:34:16 +00:00
a5a0fdcae9 Remove some unnecessary skipIfTorchDynamo (#118725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118725
Approved by: https://github.com/bdhirsh
2024-01-31 18:18:17 +00:00
680cc6b17a [export] Fix graph signature for primitive outputs (#118655)
Summary:
Now that we allow primitive outputs, we need to fix how the graph
signature outputs user_outputs

Test Plan: CI

Differential Revision: D53233649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118655
Approved by: https://github.com/tarun292
2024-01-31 18:00:02 +00:00
8455447972 Support builtin callable with object arguments in dynamo (#118678)
Fix issue #117556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118678
Approved by: https://github.com/anijain2305
2024-01-31 17:54:08 +00:00
68c3cb7594 s/fialure/failure/ (#118744)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118744
Approved by: https://github.com/peterbell10
2024-01-31 17:42:44 +00:00
suo
5586d7797e fix up batchnorm folding in pt2 quant (#118720)
Changes to how attributes are structured messed this pass up, fix it

Differential Revision: [D53253601](https://our.internmc.facebook.com/intern/diff/D53253601/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118720
Approved by: https://github.com/SherlockNoMad
2024-01-31 17:40:47 +00:00
4a677da36b Add more triton kernel mutation tracking tests (#118691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118691
Approved by: https://github.com/aakhundov
ghstack dependencies: #118676, #118595
2024-01-31 17:38:17 +00:00
b4f4fd0c28 Parse and handle functions in TTIR (#118595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118595
Approved by: https://github.com/aakhundov
ghstack dependencies: #118676
2024-01-31 17:38:17 +00:00
1bf9ddf130 add test_truth (#118597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118597
Approved by: https://github.com/anijain2305
2024-01-31 15:10:58 +00:00
1128cf96f0 [AOTI] Support _embedding_bag in C shim (#118706)
Summary: At some point I will stop manually adding ops to C shim, but use torchgen to generate those code. For the near term, I need to add a few more in order to switch the AOTInductor dashboard run.

Differential Revision: [D53249074](https://our.internmc.facebook.com/intern/diff/D53249074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118706
Approved by: https://github.com/frank-wei, https://github.com/aakhundov
ghstack dependencies: #118704, #118705
2024-01-31 15:02:40 +00:00
8db8ff652c [AOTI] Add aoti_torch_view_dtype in C shim (#118705)
Summary: Support ir.ComplexView in the ABI-compatible codegen

Differential Revision: [D53249039](https://our.internmc.facebook.com/intern/diff/D53249039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118705
Approved by: https://github.com/frank-wei
ghstack dependencies: #118704
2024-01-31 14:42:29 +00:00
dd52939438 [inductor] Refactor ir.ComplexView (#118704)
Summary: Make ir.ComplexView a subclass of ir.FallbackKernel, to unify the codegen logic

Differential Revision: [D53248972](https://our.internmc.facebook.com/intern/diff/D53248972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118704
Approved by: https://github.com/frank-wei
2024-01-31 14:42:29 +00:00
35f3ccffd4 [Cutlass 3.3.0 submodule upgrade] (#118629)
Cutlass 3.3 offers the following improvements:

Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements. minor doc update
Test Plan:

CI ( ciflow/trunk, ciflow/inductor )
pytest test/inductor/test_max_autotune.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118629
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/khabinov
2024-01-31 13:53:58 +00:00
c3a3e61bcb Resolve TODO in test_slice_mutation2 (#118712)
As https://github.com/pytorch/pytorch/issues/94693 has been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118712
Approved by: https://github.com/peterbell10
2024-01-31 08:26:22 +00:00
9afd539075 [sigmoid] update serialization to include custom objs (#118684)
Summary: Update the serialization code to handle custom objs.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//sigmoid/frontend/test_gpu:serializer_test

Differential Revision: D53139356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118684
Approved by: https://github.com/angelayi, https://github.com/suo
2024-01-31 08:23:34 +00:00
56718cab8d Unskip test_complex_type_conversions (#118694)
Resolve TODO and unskip test_complex_type_conversions as real and imag have been implemented for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118694
Approved by: https://github.com/huydhn
2024-01-31 08:04:15 +00:00
73229b4f93 Add --filter-rank to torchrun: allow logs filtering by rank (#118562)
Addresses issue https://github.com/pytorch/pytorch/issues/117383

The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr)

## Behavior
### with --tee
Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console.

### with --redirect
When --redirect is specified without --tee, nothing is logged to console, so we no-op.

### with neither
When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console.

The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation.

## Usage
### without --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py
hello from rank 0 python
DEBUG: TRACED GRAPH
 __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
-------------  ------  -----------------------  ---------  --------
placeholder    l_x_    L_x_                     ()         {}
call_function  mul     <built-in function mul>  (l_x_, 5)  {}
output         output  output                   ((mul,),)  {}
...
```
### with --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py
[rank0]:hello from rank 0 python
[rank0]:DEBUG: TRACED GRAPH
[rank0]: __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
[rank0]:-------------  ------  -----------------------  ---------  --------
[rank0]:placeholder    l_x_    L_x_                     ()         {}
[rank0]:call_function  mul     <built-in function mul>  (l_x_, 5)  {}
[rank0]:output         output  output                   ((mul,),)  {}
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-01-31 07:40:01 +00:00
995f69623d Add Silu to Dtensor Pointwise ops (#118702)
# Summary
Adds silu to the supported list, needed for llama2 mlp support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118702
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
2024-01-31 06:17:36 +00:00
74f4947caf Fix admm over empty tensors and broadcastable input (#118619)
Fixes https://github.com/pytorch/pytorch/issues/118131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118619
Approved by: https://github.com/albanD
2024-01-31 05:40:25 +00:00
2d37a046e7 [export] Enforce serialization BC/FC with updater script. (#118424)
Summary:
This diff implements a mechanism for safely update torch.export serialization schema, aka schema.py, which is the API surface having the strongest compatibility guarantee.

The diff is consist of 3 changes:
- Added a script to "build" or "materialize" schema.py into a platform neutral format (yaml), which serves as the committed form of the seialization schema.
- Added unittest to compare against schema.py and schema.yaml, so that it forces developers to execute the updater script when there is mismatch between two files.
- Added a checker inside the updater script, so that all the compatible change will result in a minor version bump, and all the incompatible changes will result in a major version bump.

torch.export's serialization BC/FC policy is (tentatively) documented here: https://docs.google.com/document/d/1EN7JrHbOPDhbpLDtiYG4_BPUs7PttpXlbZ27FuwKhxg/edit#heading=h.pup7ir8rqjhx , we will update the

As noted in the code doc, people should be able to run the following command to update schema properly from now on:

```
    python scripts/export/update_schema.py --prefix <path_to_torch_development_diretory>
or
    buck run caffe2:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/
```

Test Plan:
buck test mode/opt caffe2/test:test_export -- -r test_schema
buck run caffe2:update_export_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/

Differential Revision: D52971020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118424
Approved by: https://github.com/angelayi
2024-01-31 05:37:58 +00:00
697ca4f292 Preliminary DeviceMesh + native c10d functional integration (#118423)
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.

### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
2024-01-31 04:36:12 +00:00
e3cde68534 [FSDP2] Added initial _lazy_init and FQNs for debugging (#117881)
This PR adds the initial `_lazy_init`. Lazy initialization marks the point when the FSDP structure is finalized and is typically the beginning of the first forward. This would be after any meta-device initialization.
- Lazy initialization is distinct from construction time because when processing `fully_shard(module)`, there is no way to know whether a parent of `module` will have `fully_shard` applied as well. This is a consequence of `fully_shard` having to be applied bottom-up.
- At lazy initialization, we now have the concept of a _root_. The root FSDP module is the one whose `forward` runs first and ends last (and hence similarly for its backward). Having a single root simplifies handling logic that should only run "once per forward/backward/iteration". We may consider relaxing this in the future, but it will add more complexity to the design.
- Once we have a root, we can define _fully-qualified names_ (FQNs) for both parameters and modules. To aid debugging, we store `_param_fqn` and `_module_fqn` on `FSDPParam` and `FSDPParamGroup`, respectively. Note that we can have a unique `_module_fqn` for `FSDPParamGroup` since we currently assume a 1:1 relationship.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117881
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118525, #117814, #117867, #117877
2024-01-31 03:38:53 +00:00
f7ae454003 [vision hash update] update the pinned vision hash (#118700)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118700
Approved by: https://github.com/pytorchbot
2024-01-31 03:10:52 +00:00
6d7cfb5c3f [audio hash update] update the pinned audio hash (#118699)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118699
Approved by: https://github.com/pytorchbot
2024-01-31 03:10:48 +00:00
0a7e2ce0e1 [PT-Vulkan] aten::conv1d - support any stride, padding, dilation (#118660)
Summary:
This diff stack builds on yipjustin's initial special-case implementation: D50914117.

That special-case only covers
```
strides = 1
padding = 0
dilation = 1
in_channels = out_channels = groups
n = 1
```

Test Plan:
```
[jorgep31415@161342.od /data/sandcastle/boxes/fbsource (a0b8b9b7f)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl
3 additional file change events
Buck UI: https://www.internalfb.com/buck2/ebb61796-c71d-4e0c-8148-de1eb67b5d4c
Network: Up: 10KiB  Down: 53MiB  (reSessionID-5f852cf6-9bf1-4c73-a471-4c121b53ed62)
Jobs completed: 16. Time elapsed: 21.6s.
Cache hits: 43%. Commands: 7 (cached: 3, remote: 0, local: 4)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (136 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (35 ms)
[----------] 2 tests from VulkanAPITest (172 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (172 ms total)
[  PASSED  ] 2 tests.
```

Reviewed By: yipjustin

Differential Revision: D53204673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118660
Approved by: https://github.com/yipjustin
2024-01-31 01:49:09 +00:00
suo
68a75d4539 [lint] remove merge_base_with from .lintrunner.toml (#118677)
This setting is problematic in fbcode, where the expected behavior is to match `arc lint`, which has a behavior much like running `lintrunner` without a `--merge-base-with` argument.

Let's try removing this. I also updated the CI message to encourage people to run with `-m origin/main`, which should hopefully cut down on confusion in the absence of defaulting to that behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118677
Approved by: https://github.com/PaliC
2024-01-31 00:53:58 +00:00
07a7feca74 [FSDP2] Sharded parameter in FSDPParam (#117877)
This PR adds logic to shard the managed parameters on dim-0. This is like `distribute_tensor()` with two differences:
1. `distribute_tensor()` today cannot accept a `DTensor` and reshard it to the parent mesh (https://github.com/pytorch/pytorch/issues/116101).
2. `DTensor` does not pad its local shard on any `Shard` dimensions (https://github.com/pytorch/pytorch/issues/113045).

As such, the `FSDPParam._init_sharded_param()` derives the global `DTensor` metadata itself and pads the local tensor on dim-0. The padding helps make the all-gather copy-in more efficient since the all-gather buffer will require padding.

---

Some details:
- We free the original parameter manually after constructing the sharded parameter. This lowers the peak memory during construction time slightly (since not _all_ parameters in the group must be sharded before the original parameters are freed) and is not strictly necessary.
- We bypass `nn.Module.__setattr__` because the checks are slow and unnecessary. The drawback is that we would ignore a user-defined override of `__setattr__`; however, since we have never encountered this in practice, I am okay with this. Notably, user calls to `setattr` would still use the override; FSDP only uses `setattr` as a mechanism for switching between sharded and unsharded parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117877
Approved by: https://github.com/wanchaol
ghstack dependencies: #118525, #117814, #117867
2024-01-31 00:44:19 +00:00
cyy
4a019047ad Enable nested namespace check in clang-tidy (#118506)
It is time to enable nested namespaces in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506
Approved by: https://github.com/albanD
2024-01-31 00:32:35 +00:00
1b03423526 [meta registration] fix _efficient_attention_forward for jagged inputs (#118657)
Fixes the meta registration for the logsumexp output, whose shape should
be defined by the size of the offsets tensor when it exists.

644f64f2d1/aten/src/ATen/native/transformers/cuda/attention.cu (L1045)

Differential Revision: [D53234217](https://our.internmc.facebook.com/intern/diff/D53234217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118657
Approved by: https://github.com/YuqingJ
2024-01-31 00:11:39 +00:00
6fa162e681 Reland: [aotinductor] Replicate split_cat from torch IR to predispatch IR" (#118590)
Summary:
This is part the pass migration efforts. The final target is removing the acc tracer in AOTI.
In this diff, I did a few things:
1. copy and modify the `fx_passes/split_cat.py` passes based on predispatch IR.
2. verify the correctness by copying the `test_split_cat_fx_passes.py` and create a new file `test_split_cat_fx_passes_aten_fb.py` which is executed in AOTI and checked the counters
3. create a util function to execute the pass and compare the before/after graph to give user more information like pass effect and time spent. It will create logs like
```
[2024-01-25 20:26:48,997] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 0, save before/after graph to /tmp/tmpvlpwrklp, graph before/after are the same = False, time elapsed = 0:00:00.001585
[2024-01-25 20:26:49,000] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 1, save before/after graph to /tmp/tmpz_onjfeu, graph before/after are the same = False, time elapsed = 0:00:00.001873
[2024-01-25 20:26:49,002] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 2, save before/after graph to /tmp/tmpgkck8yko, graph before/after are the same = True, time elapsed = 0:00:00.000269
[2024-01-25 20:26:49,007] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 3, save before/after graph to /tmp/tmpquenq06y, graph before/after are the same = False, time elapsed = 0:00:00.003621
[2024-01-25 20:26:49,009] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 4, save before/after graph to /tmp/tmpi8fia0dv, graph before/after are the same = True, time elapsed = 0:00:00.000190
```

Differential Revision: D53171027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118590
Approved by: https://github.com/kflu, https://github.com/khabinov, https://github.com/chenyang78
2024-01-31 00:09:46 +00:00
7761ceb6b3 Fix a bug with python lambda capture (#118676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118676
Approved by: https://github.com/jamesjwu, https://github.com/aakhundov
2024-01-30 23:59:07 +00:00
616e9dbed8 add torch.float64 precision support to the transformer test suite in TP/SP (#116436)
This PR (as a followup to #115530) resolves previous issues of not passing `assertEqual()` tests (with small error) when comparing outputs from the single-gpu model and the distributed model, under certain input/model sizes or when certain operations (e.g. weight-tying) are enabled. This is done by simply enabling higher precision computation using `dtype=torch.float64`.

What is not tested: whether or not distributed model training convergence rate is affected using just `torch.float32` precision.

Test plan:
TP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
TP+SP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116436
Approved by: https://github.com/wanchaol
2024-01-30 23:50:29 +00:00
1f376b3b24 Flix lint after #117814 (#118689)
Forward fix after PR: #117814 . make lint green again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118689
Approved by: https://github.com/awgu, https://github.com/huydhn
2024-01-30 23:46:27 +00:00
1e78dc95a4 Fix/Temporarily disable tests broken due to triton version mismatch (#118661)
Summary:
These test were broken because internal triton is 2.2 whereas external is 3.0.

Will update after internal version catches up.

Test Plan: CI

Differential Revision: D53231204

Co-authored-by: Oguz Ulgen <oulgen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118661
Approved by: https://github.com/oulgen
2024-01-30 23:06:35 +00:00
2f7839e6db register decomposition for rsub in torch._refs (#118288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118288
Approved by: https://github.com/lezcano
ghstack dependencies: #118398
2024-01-30 22:18:15 +00:00
04ded1399d Fix signatures of torch.{add, sub, mul} (#118398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118398
Approved by: https://github.com/lezcano
2024-01-30 22:18:15 +00:00
6ea233a14c [FSDP2] Added initial FSDPParamGroup, FSDPParam, ParamModuleInfo (#117867)
This PR adds the initial `FSDPParamGroup` and `FSDPParam` classes, and it focuses on the `ParamModuleInfo` data structure.

- `ParamModuleInfo` has the info needed to `setattr` a managed parameter, where it must account for shared parameters and shared modules.
    ```
    # Shared parameter
    lin1.weight = lin2.weight

    # Shared module
    mlp.lin1 = mlp.lin2
    ```
- In order for FSDP to find shared modules' parameters, we must use `remove_duplicate=False`. See https://github.com/pytorch/pytorch/pull/99448/ for the original context. Finding shared modules' parameters is not necessary for the `setattr` logic, but in case we need it in the future (like for existing FSDP's state dict), we include that info for now.

With this PR, we see the general system architecture:
- 1 `module` : 1 `fully_shard`
- 1 `fully_shard` : 1 `FSDPParamGroup`
- 1 `FSDPParamGroup` : k `FSDPParam`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117867
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118525, #117814
2024-01-30 22:07:59 +00:00
ae6233ec47 [FSDP2] Added mesh arg, FSDPState, move to device (#117814)
Squashed to include https://github.com/pytorch/pytorch/pull/117861, https://github.com/pytorch/pytorch/pull/117852

---

This PR adds `_get_managed_modules()` to determine which modules a `fully_shard(module)` call manages. The rule is defined as:
> `fully_shard(module)` manages all modules in `module.modules()` except those already managed by a nested `fully_shard()` or a nested non-composable API (e.g. `replicate()` or TorchRec).

Practically, this can be implemented as a graph search from `module` that does not proceed into any module with `fully_shard` or a non-composable API applied. Because the non-composable APIs follow the same rule, this rule is correct inductively.

---

This PR adds `_get_managed_states(managed_modules)` to return the managed parameters and buffers given the managed modules.
- Without an extra mechanism to ignore specific parameters or buffers, the rule currently is simply to get the directly managed state (i.e. parameters/buffers) from each managed module while de-duplicating shared ones.
- However, we prefer this translation from managed modules to managed states to accommodate ignoring specific states in the future (which has appeared in various open-source use cases).

---

This PR adds the `mesh` argument to `fully_shard` and some helper data structures specific to FSDP/HSDP that pre-compute useful info like rank/world size for each mesh dim.
- The `mesh` defines the FSDP/HSDP algorithm. 1D mesh means FSDP, and 2D mesh means HSDP, where we assume sharding on the last dimension.
    - We can revisit the HSDP sharding-dim assumption if needed in the future.
- The default (if `mesh is None`) is that `fully_shard` calls `init_device_mesh` following the global process group.
- The helper data structures are the various `*MeshInfo`s. I included up to the `HSDPMeshInfo` even though it will not be immediately used to show the spirit of it. We want to tag both the shard and replicate dims.
- The `mesh_info` variable in `fully_shard` is not used for now. It will be passed downstream in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117814
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #118525
2024-01-30 22:05:16 +00:00
7aa4b35b75 [FSDP2][Reland] Introduced initial fully_shard frontend (#118525)
This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP.
- We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one.
- We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module.
    - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`.
    - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able.
- Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state.
- We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794).
- In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name.

**Reland details:** I removed `test/distributed/_composable/fsdp/_test_fully_shard_common.py` and moved its contents to the existing `torch/testing/_internal/common_fsdp.py`, which is already a target for internal tests.

Differential Revision: [D53187509](https://our.internmc.facebook.com/intern/diff/D53187509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118525
Approved by: https://github.com/wanchaol
2024-01-30 22:05:16 +00:00
48f876143a Fix missing permission in create release workflow (#118681)
Fixes https://github.com/pytorch/pytorch/actions/runs/7715417683/job/21029944543
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118681
Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/atalman, https://github.com/malfet
2024-01-30 22:02:30 +00:00
1aa836f502 Dont fuse write into read if indexing differs (#118210)
Fix for https://github.com/pytorch/pytorch/issues/101950, https://github.com/pytorch/pytorch/issues/94693

Similar to inplacing a kernel only fuse a write after a read of the same tensor if the write and read have same indexing formula. I did a perf test and it was neutral.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118210
Approved by: https://github.com/jansel
2024-01-30 21:55:27 +00:00
82a7460b67 [quant][bc-breaking] Turn on fold_quantize by default (#118605)
Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
`fp32 weight -> q -> dq -> linear -> ...` in the quantized model

After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
`int8 weight -> dq -> linear -> ...`

We'll remove the fold_quantize flag in the next PR

Test Plan: CI

Differential Revision: D51730862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118605
Approved by: https://github.com/andrewor14
2024-01-30 21:42:29 +00:00
ba1be17733 Remove voznesenskym from the list of autoreviewers (#118680)
Mitigates the failures of "Auto Request Review" workflow:
```
Requesting review to ezyang, albanD, miladm, voznesenskym, antoniojkim, SherlockNoMad
Error: HttpError: Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the pytorch/pytorch repository.
```
https://github.com/pytorch/pytorch/actions/runs/7716852492/job/21034629665?pr=118669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118680
Approved by: https://github.com/clee2000
2024-01-30 21:35:38 +00:00
f2682e75e6 Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586)
Info about super in dynamic classes:
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions

Mainly doing this because it's making disable bot spam

Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586
Approved by: https://github.com/huydhn
2024-01-30 21:34:05 +00:00
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
e332653eb3 [inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490
Approved by: https://github.com/desertfire
2024-01-30 21:03:19 +00:00
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
e33e88e5bc Add separate logging target for cudagraphs (#118329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118329
Approved by: https://github.com/mlazos
2024-01-30 20:16:51 +00:00
e180218949 [c10d] Log the last enqueued and completed collective (#118582)
Summary:
During debugging of some timeouted jobs, I found it difficult to
identify which rank is at fault eventhough we have logs of many ranks
reporting timeout on a specific collective seq.

If we can also report lastEqueuedSeq and lastCompletedSeq, it would be
much easier to identify,
1. whether a rank has not even join a collective call (not enqueued)
2. Or it has joined the collective call, but not completed.

For the 1st case, it is mostly likely users code problem
for the 2ed case, it could be lower-layer issues

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582
Approved by: https://github.com/wconstab
2024-01-30 20:13:55 +00:00
9247641f34 [PT-Vulkan] aten::unsqueeze - nit optimization (#118575)
Summary:
Learning Vulkan shaders and realized one of the branches can be easily optimized.

The relevant branch is only taken when we unsqueeze along `dim == 1` for 3D tensors.
1. There's an unnecessary for-loop.
2. There's an unnecessary dependency on the output tensor's number of channels.

## CPU Tensor
```
3D->4D: (c, h, w) -> (c, 0, h, w)
```
## GPU Texture
```
3D->4D: (w, h, c/4)[c%4] -> (w, h, c)[0]
```

Note the GPU Texture's output is always at `[0]` and the output tensor's number of channels is always 1.

We are currently writing the same value `v[p]` to all elements of the texel `out_texel`, but we need only write it to `out_texel[0]`:

Test Plan:
```
[jorgep31415@161342.od /data/sandcastle/boxes/fbsource (ca3b566bc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*unsqueeze*"
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
Buck UI: https://www.internalfb.com/buck2/2c7f1365-e004-41a0-9201-473929a2738a
Network: Up: 174B  Down: 0B  (reSessionID-c54d25da-f44b-49f7-8bfd-1db4eee50f6d)
Jobs completed: 6. Time elapsed: 14.4s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_0dto1d_dim0
[       OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (60 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim0
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim1
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (132 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim0
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (20 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim1
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (66 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim2
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (3 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim0
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (19 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim1
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim2
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim3
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 10 tests from VulkanAPITest (307 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (307 ms total)
[  PASSED  ] 10 tests.
[
```

Differential Revision: D53189637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118575
Approved by: https://github.com/yipjustin
2024-01-30 20:01:18 +00:00
suo
d0627cc2af [export] do not rewrite state dict when unlifting (#118611)
This is Very Bad; changing state dict keys violates one of the key contracts we have, which is "do not mess with the state dict".

Change unlift to use a similar `_assign_attr` approach that fx.GraphModule and unflatten do.

Also took the opportunity to improve the interface of `_assign_attr` to be more general.

Differential Revision: [D53139277](https://our.internmc.facebook.com/intern/diff/D53139277/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118611
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608, #118609, #118610
2024-01-30 19:14:19 +00:00
suo
be90ab7efd [export] do not unlift cond/map submodules (#118610)
I don't think we should be unlifting HOO submodules.

What is the constract of unlifting? It is: restore the original calling convention of the module, undoing the transformation in which we lift parameters, buffers, and constants to inputs in the graph.

Unlifting does *not* make any guarantees about what's going on inside the module. It's still a flat module. So why should we lift the cond/map submodules? It doesn't have anything to do with the contract stated above; it's some internal stuff that doesn't affect how the module will be called.

Further, this code as written modifies the state dict; adding a new buffer that is actually duplicate of a previous buffer. Modifying the state dict from the original eager module is never correct.

Differential Revision: [D53160713](https://our.internmc.facebook.com/intern/diff/D53160713/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118610
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608, #118609
2024-01-30 19:14:18 +00:00
suo
4ee8aa6028 [export] adopt KeyPath API in nonstrict mode (#118609)
This PR rewrites two paths to use the newly-added keypaths API in pytree:
First: we were hand-rolling a tree_map during fakification because we wanted to track sources. This PR uses keypaths instead, which can do the same thing without needing custom code.

Second: our constraint error formatting was referencing placeholder names in error messages. These placeholder names are not otherwise user-visible, so they are super confusing to users (e.g. "which input does arg1_3 correspond to?"). This diff uses the `keystr` API to format the error message.

This necessitated some small refactors—generating the keystr is expensive so doing it in an f-string was very bad.

It can also be further improved—we can inspect the signature so that instead of `*args[0]` we can give people the actual argument name, which would be the ideal UX. But leaving that for later.

Differential Revision: [D53139358](https://our.internmc.facebook.com/intern/diff/D53139358/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118609
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608
2024-01-30 19:14:11 +00:00
suo
ca090b2c77 [export] do not use tree_flatten_spec (#118608)
tree_flatten_spec is bad; it isn't synced up with `register_pytree_node` so it will not handle arbitrary custom pytrees. It's also not really maintained.

We only use it for two purposes:
- To retain kwarg ordering stability, so that if the user passes in kwargs in a different order things will still work.
- To do "structural" checks that ignore types.

In both cases, tree_flatten_spec is probably *not* the ideal way to implement the desired behavior.

## kwargs ordering
- tree_flatten_spec overwrites the behavior of ALL dictionaries, not just kwargs. This is not correct, dictionary ordering is meaningful in Python, and it's pretty trivial to write a program that relies on dict ordering.
- For kwargs, we do sort of expect that the order in which arguments are passed shouldn't matter. BUT there is one exception: `**kwargs`. In fact, [PEP 468](https://peps.python.org/pep-0468/) was introduced specifically to clarify that ordering does matter when the function being called uses `**kwargs`.

In this diff I introduce a utility function that *only* reorders kwargs. This gets us most of the way to correct—dicts are no longer reordered, but kwargs can be passed in any order.

A "fully correct" solution would need fix the corner case from PEP468. We could detect whether the top-level fn being traced uses `**kwargs` (via `inspect`), then serialize a flag for it. In ExportedProgram, we would check that flag and only re-order if `**kwargs` was unused; otherwise error if the key order doesn't match. This is a super corner case though, so I'll file it as a followup task.

## structural equivalence checking

This is another use case, where again `tree_flatten_spec` is too broad. Generally we want to treat a precise two types as the same, not override the behavior of comparison generally. So I introduce an `is_equivalent` util for this purpose.

Differential Revision: [D53168420](https://our.internmc.facebook.com/intern/diff/D53168420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118608
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607
2024-01-30 19:14:04 +00:00
bc9642f578 Skip more tests under rocm (#118624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118624
Approved by: https://github.com/aakhundov
2024-01-30 19:06:06 +00:00
e6e7d7f26b [pt-vulkan] Introduce MemoryAllocation class and enable deferred allocation and resource aliasing (#118436)
## Context

This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes).

This changeset enables [resource aliasing](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/resource_aliasing.html), a technique that allows two resources (i.e. `VkImage`s or `VkBuffer`s to bind to the same memory allocation. This is the core feature that allows memory planning to be implemented in PyTorch Vulkan.

## Notes for Reviewers

At a high level, this changeset introduces the `MemoryAllocation` struct which represents a raw `VmaAllocation`. `VulkanImage` and `VulkanBuffer` have been updated to store a `MemoryAllocation` member instead of the raw handle of a `VmaAllocation`.

`vTensor`, `VulkanImage`, and `VulkanBuffer` constructors now have a `allocate_memory` argument which controls if memory should be allocated on construction. If `false`, then memory must be allocated separately and bound later using `bind_allocation()` before the resource can be used.

Internal:

## Notes for Internal Reviewers

Please refer to [this design doc](https://docs.google.com/document/d/1EspYYdkmzOrfd76mPH2_2BgTDt-sOeFnwTkV3ZsFZr0/edit?usp=sharing) to understand how memory planning will work end-to-end in the Vulkan Delegate.

Differential Revision: [D53136249](https://our.internmc.facebook.com/intern/diff/D53136249/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118436
Approved by: https://github.com/yipjustin
2024-01-30 19:03:55 +00:00
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45ef53747e2eefffd65d91ce840b431b.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
suo
6511811ebb [export] preserve metadata during nonstrict tracing (#118607)
Previously, nonstrict tracing would wipe metadata of graphmodules, because the wrapper class we're using was not detected as a graphmodule and thus meta preservation was not turned on

Differential Revision: [D53139354](https://our.internmc.facebook.com/intern/diff/D53139354/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118607
Approved by: https://github.com/zhxchen17
2024-01-30 18:27:52 +00:00
644f64f2d1 [c10d] added docstrings and tests for src / dst (#118593)
Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg
* update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments
* communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend``
* communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list``
* ``pytest test/distributed/test_c10d_nccl.py -k subgroup``

3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device
* use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2
* ``torch.cuda.set_device(global_rank)``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593
Approved by: https://github.com/wconstab
2024-01-30 17:47:58 +00:00
19e8ba95e5 [RELAND] Remove deprecated fbgemm operators (#112153)
These operators are not used and have been deprecated since #72690
(Feb 2022).

BC-breaking message:

`TorchScript` models that were exported with the deprecated
`torch.jit.quantized` API will no longer be loadable, as the required
internal operators have been removed.
Please re-export your models using the newer `torch.ao.quantization` API
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153
Approved by: https://github.com/jerryzh168
2024-01-30 16:32:37 +00:00
2327879fb6 Add lowering to special.bessel_j0 (2nd try) (#118565)
This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565
Approved by: https://github.com/peterbell10
2024-01-30 15:26:59 +00:00
fbf92500fb enable privateuseone to perform streaming backward (#117111)
Fixes #116957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117111
Approved by: https://github.com/soulitzer
2024-01-30 15:13:31 +00:00
15702a8027 Fix lnit after #118533 (#118633)
Fixes lint after https://github.com/pytorch/pytorch/pull/118533
Adds ignore ``possibly-undefined`` to more places

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118633
Approved by: https://github.com/DanilBaibak
2024-01-30 14:07:16 +00:00
827949cef2 accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)
When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function.

Simple benchmark on AMD 3600 CPU Ubuntu 22.04:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|original|1986|1658|
|this PR|1295|995|

faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code.

CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned.

The simple benchmark cpp file:
[demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539
Approved by: https://github.com/malfet
2024-01-30 13:24:13 +00:00
e5bb527d3e [inductor][cpp] support scalar value in vec reduction (#118511)
Fix https://github.com/pytorch/pytorch/issues/118379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118511
Approved by: https://github.com/leslie-fang-intel, https://github.com/lezcano, https://github.com/jansel
2024-01-30 13:07:43 +00:00
91690983ff [easy] Faster empty LIST_LENGTH guard (#118542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118542
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-01-30 13:02:18 +00:00
64efec9953 Port FakeProcessGroup to cpp (#118426)
### Summary
Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops.

The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426
Approved by: https://github.com/wanchaol
ghstack dependencies: #113057
2024-01-30 11:40:13 +00:00
da0635d17c Add pytorch-distributed justknobs helper (#118568)
Summary:
Sets up a helper that checks any JKs relevent to pytorch distributed,
and propagates their values to ENV.

Test Plan: Added unit test

Differential Revision: D53192406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118568
Approved by: https://github.com/zdevito
2024-01-30 08:13:52 +00:00
3ecc2f3a0d [PT2][Runtime Numeric Check] Fix compatibility issue (#118578)
Summary: Titled

Test Plan: WIP

Differential Revision: D53196722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118578
Approved by: https://github.com/jackiexu1992
2024-01-30 08:04:27 +00:00
b7c8485704 refactor mm_plus_mm check to pattern match (#118456)
Fixes #103101

replace #103253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118456
Approved by: https://github.com/jansel
2024-01-30 07:48:06 +00:00
c7af626a26 [c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256)
resolve #117749

Summary:
Updated the PR with the following intentions:

1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled.
2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call.
3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call.
4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256
Approved by: https://github.com/kwen2501
2024-01-30 06:23:20 +00:00
e632d0c0dc Break Triton MutationTests to one kernel per test (#118553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118553
Approved by: https://github.com/aakhundov
ghstack dependencies: #118588
2024-01-30 06:17:55 +00:00
eqy
4a48899b6e [CUDA][complex] Define LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS in CUDA build (#117061)
An upcoming CUDA release will migrate to CCCL, and we need this define to preserve current complex behavior: https://nvidia.github.io/libcudacxx/standard_api/numerics_library/complex.html

CC @miscco @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117061
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-30 06:11:31 +00:00
c203d88795 Skip mutation tests on rocm (#118588)
Fixes #118585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118588
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-01-30 05:46:54 +00:00
fe07851173 [CUDA][TF32][functorch] Also disable TF32 for vjp and jvp tests (#118592)
CC @zou3519
Appears to be the same issue as https://github.com/pytorch/pytorch/issues/86798
Seen surfacing on >= sm80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118592
Approved by: https://github.com/zou3519
2024-01-30 05:34:20 +00:00
8be6dee14b [inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569)
Summary:
### Context

It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`.
* First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()`
* Second in `self.codegen_kwargs()`.

When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration.
```
auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
...
// There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0.
// And there's no reference to tmp_tensor_handle_0.
// Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't
// automatically cleaned-up like RAIIAtenTensorHandle
CUdeviceptr var_6;
aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void**>(&var_6));
void* kernel_args_var_2[] = {..., &var_6, ...};
launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2);
```

### Solution
We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`.

Test Plan:
### Inspect device memory allocated
```
# Before diff
0 device memory 2048
1 device memory 2560
2 device memory 3072
3 device memory 3584
4 device memory 4096
5 device memory 4608

# With diff (memory usage doesn't grow)
0 device memory 1536
1 device memory 1536
2 device memory 1536
3 device memory 1536
4 device memory 1536
5 device memory 1536
```

Reviewed By: jingsh, tissue3

Differential Revision: D53190934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569
Approved by: https://github.com/oulgen
2024-01-30 05:19:32 +00:00
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
5dfcf07449 Reland PR117393 [inductor/fb] log config dict when compilation finishes (#118552)
Summary: Reverted due to merge conflict

Differential Revision: D53188124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118552
Approved by: https://github.com/mengluy0125
2024-01-30 04:34:22 +00:00
dcc077eea2 [executorch hash update] update the pinned executorch hash (#118594)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118594
Approved by: https://github.com/pytorchbot
2024-01-30 03:49:49 +00:00
0d47f6a44f [ez][inductor] fix a typo in should_pad_bench (#118598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118598
Approved by: https://github.com/eellison
2024-01-30 03:49:44 +00:00
135f785d77 [audio hash update] update the pinned audio hash (#118338)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118338
Approved by: https://github.com/pytorchbot
2024-01-30 03:44:00 +00:00
ff0cb38693 [vision hash update] update the pinned vision hash (#118340)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118340
Approved by: https://github.com/pytorchbot
2024-01-30 03:15:16 +00:00
2eefbc02a0 [ez] Discover tests without importing torch (#118574)
Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed.

Helpful when you don't have torch installed (aka me when I'm feeling lazy)
I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that.

The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574
Approved by: https://github.com/huydhn
2024-01-30 03:02:29 +00:00
eb9905be5d [export] Remove the branch for skipping verifier. (#118139)
Summary:
We used to skip verifier when the signature object is not the "correct" one (usually from some deprecated frontend). This was very useful when we wanted to pay a small cost to enable verifier path to be called everywhere for torch export.

Now I believe no tests are relying on this behavior so we should remove this weird branch.

Test Plan: CI

Differential Revision: D53024506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118139
Approved by: https://github.com/suo
2024-01-30 02:58:03 +00:00
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
126c1621ce Add Support for CausalBias to torch compile (#116071)
Fixes #115363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116071
Approved by: https://github.com/mlazos
2024-01-30 02:22:48 +00:00
67d8db9252 Remove semicolon after return_from_mutable_noop_redispatch (#118538)
[`return_from_mutable_noop_redispatch`](65f8276bc6/torchgen/gen_functionalization_type.py (L477)) calls
[`return_str`](65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538
Approved by: https://github.com/colesbury
2024-01-30 02:22:42 +00:00
0ed24cb1af [export] comments about runtime_var_to_range. (#118539)
Summary: Add some comments in case we forgot what runtime_var_to_range means

Test Plan: eyes

Differential Revision: D53186114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118539
Approved by: https://github.com/suo
2024-01-30 02:07:34 +00:00
b1f8b6b8fc Forward Fix accidental removal of import (#118572)
Summary:
This Diff is a forward fix for this PR: https://github.com/pytorch/pytorch/pull/114689

Where I accidentally removed the old import from backends/cuda.

Test Plan: Verrified on failing revert diff and it did indeed fix the issue

Reviewed By: DanilBaibak

Differential Revision: D53193454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118572
Approved by: https://github.com/DanilBaibak
2024-01-30 02:07:19 +00:00
460950d3aa [Nested Tensor] Support ragged_idx != 1 on aten::is_same_size, aten::_to_copy (#118442)
is_same_size is needed internally; `_to_copy` should be easy because it doesn't support new layouts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118442
Approved by: https://github.com/cpuhrsch
2024-01-30 01:32:51 +00:00
6c9f72156e Fix constant folding bug with sym size tensor (#118411)
When there was a constant folded SymInt which was used to construct a then constant folding tensor, we had previously used tried to use the sympy symbol which would error (should take in SymInt not symbol).

Fix by recording the observed size during constant folding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118411
Approved by: https://github.com/ezyang
2024-01-30 01:26:51 +00:00
aef820926c Add some tests for 3d channels last (#118283)
Part of a multi-PR work to fix #59168.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118283
Approved by: https://github.com/albanD
2024-01-30 01:26:47 +00:00
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
796d270392 [easy] Fix small typo in register_state_dict_pre_hook doc (#118571)
Fixed https://github.com/pytorch/pytorch/pull/112674#issuecomment-1912849827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118571
Approved by: https://github.com/janeyx99, https://github.com/albanD
2024-01-29 23:18:12 +00:00
413a434846 [export] Convert all export tests to .module() (#118425)
Test Plan: CI

Differential Revision: D53075379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118425
Approved by: https://github.com/suo
2024-01-29 23:06:54 +00:00
ca7cbf1226 Add memory_format to typehints of Tensor.cpu and Tensor.cuda (#118392)
Fixes #118501

which makes mypy complain if users use memory_format in torch.cpu/torch.cuda in their code.

this adds the missing memory_format to the typehints of both functions.
I believe there is no test infrastructure for type hints....
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118392
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-29 22:56:34 +00:00
e1cbf6dff5 Use SEQUENTIAL posix_fadvise on mmapped files (#117805)
In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes).

Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...).

With this, they run at ~1.5 GB/s which is still bad but better than before!

It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be.

All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp.

I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805
Approved by: https://github.com/mikaylagawarecki
2024-01-29 22:38:00 +00:00
67c6152f4e [HigherOrderOp] support while_loop in dynamo (#116913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116913
Approved by: https://github.com/zou3519
2024-01-29 22:32:32 +00:00
e3d7a19f73 [CI] add wait for /orig branch in mergeability check (#118576)
---

Test runs:
* [happy path](https://github.com/pytorch/pytorch/actions/runs/7702614677/job/20991275431?pr=118576) (this PR)
* [waiting for the hardcoded branch name](https://github.com/izaitsevfb/pr-head-test/actions/runs/7702386966/job/20990584514#step:3:33) in a separate repo (step succeeded after the branch was manually pushed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118576
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-29 22:10:50 +00:00
a40be5f4dc Autograd doc cleanup (#118500)
I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500
Approved by: https://github.com/soulitzer
2024-01-29 21:51:33 +00:00
fc5cde7579 [dynamo] constant fold torch.cuda.get_device_properties to avoid graph break (#118422)
Before the PR, we have a graph break for code like this,
```python
    def test_get_device_properties_tensor_device(a):
        x = a.to("cuda")
        prop = torch.cuda.get_device_properties(x.device)
        if prop.major == 8:
            return x + prop.multi_processor_count
        return x + prop.max_threads_per_multi_processor
```
This PR constant folds the torch.cuda.get_device_properties and we'll get a following dynamo graph:
```python
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]     def forward(self, L_a_ : torch.Tensor):
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         l_a_ = L_a_
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:544 in test_get_device_properties_tensor_device, code: x = a.to("cuda")
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         x = l_a_.to('cuda');  l_a_ = None
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:547 in test_get_device_properties_tensor_device, code: return x + prop.multi_processor_count
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         add = x + 108;  x = None
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         return (add,)
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
```

The signature of get_device_properties is:
```python
def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
```
I think it's safe to constant fold get_device_properties():
1. torch.cuda.get_device_properties(tensor.device). In this case, tensor.device.index is guarded in _check_tensor
2. torch.cuda.get_device_properties(device_int_id). We don't expect the GPU properties for a particular index changes during a torch.compile run and it make sense to specialize the properties for a concrete device_int_id.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118422
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-01-29 20:26:40 +00:00
f99adbb4ec [inductor] Remove ROCm xfail on test_cum{sum,prod}_zero_dim (#118558)
Fixes #118540, fixes #118541

Since the zero-dim case reduces to a pointwise operation, we don't fallback on
ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118558
Approved by: https://github.com/malfet
2024-01-29 20:23:40 +00:00
6591741183 [dynamo] support inference_mode with no arguments (#118427)
Before the pr, we have an error for the following code
```python
def k(x):
    with torch.inference_mode():
        x = x + 1
        return x

torch.compile(k, backend="eager", fullgraph=True)(x)
```
error message:
```
Traceback (most recent call last):
....
    return InferenceModeVariable.create(tx, args[0].as_python_constant())
torch._dynamo.exc.InternalTorchDynamoError: list index out of range
```

This pr supports the case when torch.inference_mode is not provided any argument (i.e. default to True).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118427
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-01-29 20:20:26 +00:00
e0d04b7119 [Caffe2] Fix bug in str on wide types (#117531)
Summary:
The current implementation of `str` passes wide types (`wchar_t`, `wchar_t*`, `std::wstring`) directly to `std::ostringstream`. This has the following behavior:

 - C++17, `wchar_t` & `wchar_t *`: print the integer representation of the character or the pointer. This is unexpected and almost certainly a (runtime) bug.
 - C++17, `std::wstring`: compile-time error.
 - C++20, all of the above: compile-time error.

To fix the bug and to enable C++20 migration, this diff performs narrowing on these wide types (assuming UTF-16 encoding) before passing them to `std::ostringstream`. This fixes both the C++20 compile time errors and the C++17 runtime bugs.

This bug surfaced in enabling C++20 windows builds, because windows specific caffe2 code uses `TORCH_CHECK` with wide strings, which references `str` for generating error messages.

Test Plan: CI & https://godbolt.org/z/ecTGd8Ma9

Differential Revision: D52792393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117531
Approved by: https://github.com/malfet
2024-01-29 20:11:37 +00:00
68b18dc2a2 [DeviceMesh] Removed print of self._dim_group_infos (#118527)
This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527
Approved by: https://github.com/wz337
2024-01-29 19:14:25 +00:00
bb55970e5b Revert "Add justknobs env helper for pytorch distributed (#118451)"
This reverts commit 4d1bb2175a49e9b4440085a3dc2e2b211e5cf99e.

Reverted https://github.com/pytorch/pytorch/pull/118451 on behalf of https://github.com/wconstab due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/118451#issuecomment-1915369013))
2024-01-29 19:01:05 +00:00
0288db3120 [DCP] Removes Checkpoint Wrapped Prefix from state dict fqns (#118119)
Fixes #117399

~~Soliciting some early feedback here.~~

~~Do we happen to know if there already some tests that cover this case or would it make sense to add? @fegin , @wz337~~

Edit: Added tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118119
Approved by: https://github.com/fegin
2024-01-29 18:18:52 +00:00
fb11354594 Revert "[c10d] relax the nccl error check for nonblocking mode (#118254)"
This reverts commit 993e4f3911856be3a93746f6ed6a13f25de6ff65.

Reverted https://github.com/pytorch/pytorch/pull/118254 on behalf of https://github.com/clee2000 due to has internal failures D53170606 ([comment](https://github.com/pytorch/pytorch/pull/118254#issuecomment-1915267786))
2024-01-29 17:56:40 +00:00
3011a4406f [BE][GHF] Do not hardcode default branch name (#118530)
Instead rely on `GitHubPR.default_branch()` which is the name of the repo's default branch.

Do not pass branch name `merge_changes` is called, as it is set to default branch inside the function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118530
Approved by: https://github.com/clee2000
2024-01-29 17:18:23 +00:00
65f8276bc6 add an option to specify custom addr2line binary (#118328)
There is a need for users to pick their own addr2line binary in their deployment due to reasons like default addr2line being too slow etc... This option would allow user quickly experiment other alternatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118328
Approved by: https://github.com/zdevito, https://github.com/aaronenyeshi
2024-01-29 16:36:38 +00:00
abe3c55a6a Update DDP dynamo debug docs (#118295)
Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295
Approved by: https://github.com/LucasLLC, https://github.com/wanchaol
2024-01-29 14:58:26 +00:00
f9971daaee Fix divergence between internal + external (#118509)
D53049807 and https://github.com/pytorch/pytorch/pull/118197 got out of sync somehow

Fixing externally since I'm pretty sure the internal version is correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118509
Approved by: https://github.com/malfet
2024-01-29 14:53:50 +00:00
04c1df651a [inductor][cpp] enable vectorization with constant bool (#118380)
Related model DebertaForQuestionAnswering etc. For DebertaForQuestionAnswering, single thread, measured on ICX:
Before: 0.990x, After: 1.043x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118380
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-01-29 13:31:22 +00:00
ee3dfbbe47 [Inductor] Fix Argmax codegen with Nan input (#118358)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/118266, current `torch.argmax` and `torch.argmin` has different return values with eager and Inductor cpp backend when inputs has `Nan` value. Align cpp backend results to eager by reusing the compare function.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_argmin_cpu_only
python -u -m pytest -s -v test_cpu_repro.py -k test_argmax_argmin_with_nan_value
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118358
Approved by: https://github.com/lezcano, https://github.com/jgong5, https://github.com/jansel
2024-01-29 09:09:46 +00:00
41dfdde9f5 Handle some numpy functions with out arguments correctly in dynamo (#118248)
Dynamo creates Tensors when tracing through numpy ufuncs like np.sin, np.minimum etc. When running, np functions generally return Tensors when run with `torch.compile`. However, we currently require when normalizing `out` arguments that the input is an ndarray.  This creates assertion errors when running torch.compile on any numpy function with an out argument:
```
    def test_numpy_ufunc_out(self):
        @torch.compile(backend="eager")
        def foo():
            x = np.arange(5)
            out = np.empty((x.shape[0], x.shape[0]))
            res_out = np.sin(x, out=out)
            assert res_out is out
        foo()
```
Failure with stack trace: https://gist.github.com/jamesjwu/68e217638d735678b3de968584dba23f

Instead, we can wrap tensors in an ndarray in normalize_outarray to handle the case correctly. Fixing this resolves ~220 tests under dynamo_test_failures, but also exposes a followup bug.

In the presence of a graph break, ndarrays don't preserve their id, which can affect assertions and `is` checks between numpy arrays:
```
     def test_x_and_out_broadcast(self, ufunc):
        x = self.get_x(ufunc)
        out = np.empty((x.shape[0], x.shape[0]))

        x_b = np.broadcast_to(x, out.shape)
        # ufunc is just np.sin here
        res_out = ufunc(x, out=out)
        res_bcast = ufunc(x_b)
        # passes
        assert res_out is out
        graph_break()
        # fails
        assert res_out is out
```
Regular tensors preserve their id because Dynamo caches their example tensor values across a graph break. However, with ndarrays, we only store their converted tensor values, and construct new ndarrays around those values:
eebe7e1d37/torch/_dynamo/variables/builder.py (L1083)
Added a test with expected failure to showcase this — we can then fix that issue separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118248
Approved by: https://github.com/lezcano
2024-01-29 09:09:21 +00:00
4d1bb2175a Add justknobs env helper for pytorch distributed (#118451)
Summary:
Adds a JK killswitch check and configures the env for enabling pytorch
nccl flight recorder.  Note- this only enables recording events in memory, not
dumping them.

Test Plan: CI test

Reviewed By: zdevito

Differential Revision: D52920092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451
Approved by: https://github.com/malfet
2024-01-29 08:57:16 +00:00
41902a6ebc [dynamo] Optimize is_tracing checks (#118474)
benchmarks/dynamo/microbenchmarks/overheads.py
- before: 10.4us
- after: 9.9us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118474
Approved by: https://github.com/yanboliang
2024-01-29 08:31:26 +00:00
eba240afcb Revert "[FSDP2] Introduced initial fully_shard frontend (#117776)"
This reverts commit 316579e30ce820cb5f431e6bb816a882db918b38.

Reverted https://github.com/pytorch/pytorch/pull/117776 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117776#issuecomment-1914121167))
2024-01-29 07:38:41 +00:00
e6f3a4746c include a print for _get_cuda_arch_flags (#118503)
Related to #118494, it is not clear to users that the default behavior is to include **all** feasible archs (if the 'TORCH_CUDA_ARCH_LIST' is not set).

In these scenarios, a user may experience a long build time. Adding a print statement to reflect this behavior. [`verbose` arg is not available and not feeling necessary to add `verbose` arg to this function and all its parent functions...]

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118503
Approved by: https://github.com/ezyang
2024-01-29 07:03:56 +00:00
47b5a6b05d [Dynamo] Analyze triton kernels via tracing to determine mutations (#117300)
This PR adds TTIR lexing and parsing in order to analyze which of the user defined triton kernel inputs are mutated.

Differential Revision: [D53165999](https://our.internmc.facebook.com/intern/diff/D53165999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117300
Approved by: https://github.com/jansel
2024-01-29 06:37:08 +00:00
2951bbf0f7 Add some type annotations to torch._inductor.codegen.wrapper (#118491)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491
Approved by: https://github.com/Skylion007
2024-01-29 06:17:27 +00:00
5f59d0c748 [C10D] Disarm PGNCCL Heartbeat Monitor to gather data (#118344)
Summary:
Leave monitoring thread 'running' in log-only mode. Use the kill logs to
correlate with actual job outcomes (e.g. does stuck job detector agree?)

Later, re-enable (using a justknobs knob this time)

Test Plan: CI

Differential Revision: D53108142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118344
Approved by: https://github.com/shuqiangzhang, https://github.com/yifuwang, https://github.com/malfet, https://github.com/kwen2501
2024-01-29 06:09:36 +00:00
890d8e6692 [executorch hash update] update the pinned executorch hash (#118502)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118502
Approved by: https://github.com/pytorchbot
2024-01-29 03:45:45 +00:00
0d9aff2523 Removed unused “device” argument in torch.frombuffer() #118273 (#118439)
Fixes #118273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118439
Approved by: https://github.com/albanD
2024-01-28 22:01:49 +00:00
acc700739e Upgrade mypy version to 1.8.0 (#118481)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118481
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479, #118480
2024-01-28 19:22:37 +00:00
338596dfbc Forbid follow_imports = skip from mypy.ini (#118480)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118480
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479
2024-01-28 19:22:37 +00:00
119b66ba16 Use strict to toggle strict options in MYPYSTRICT (#118479)
As we force a specific version of mypy, it's OK to use the agglomerated flag.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475
2024-01-28 19:22:22 +00:00
ecca533872 Use dmypy instead of mypy in lintrunner (#118475)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118475
Approved by: https://github.com/suo
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469
2024-01-28 13:42:06 +00:00
cad79bd0bb Remove follow_imports = skip from sympy (#118469)
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.

The suppressions were added automatically with the following script generated by GPT-4:

```
import re

# Read the error file
with open("error_file.txt", "r") as f:
    errors = f.readlines()

# Parse the lines with errors and error types
error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
2024-01-28 13:38:38 +00:00
59b4d2cd40 [mypy] Remove colorama ignore_missing_imports (#118468)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118468
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467
2024-01-28 13:38:38 +00:00
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
2ed0af2bde [executorch hash update] update the pinned executorch hash (#118477)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118477
Approved by: https://github.com/pytorchbot
2024-01-28 03:56:11 +00:00
9d5b950bdd [BE][Easy]: Update ruff to 0.1.14 (#118466)
Updates ruff to 0.1.14 which has some more autofixes, bugfixes, and fixes some false positives. Full changelog found here: https://github.com/astral-sh/ruff/releases/tag/v0.1.14
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118466
Approved by: https://github.com/ezyang
2024-01-27 23:44:25 +00:00
ca1d70632d [14/N][Dynamo] Make trace_rules.lookup only handle function + callable type (#118366)
Step by step changes to unblock #118264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118366
Approved by: https://github.com/angelayi
2024-01-27 23:02:44 +00:00
62c1e4a578 Added missing CircularPad*d references so the docs are actually built. (#118465)
Fixes #118429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465
Approved by: https://github.com/Skylion007
2024-01-27 22:39:01 +00:00
2728c9137d [easy][AOT] Fix shortcut path for simple tuple/list spec (#118460)
`type(self.spec)` is always `TreeSpec` and the condition is always `False`. This PR changes it to `self.spec.type`, which is the type of tree that the spec represents.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118460
Approved by: https://github.com/Skylion007
2024-01-27 19:04:12 +00:00
1460334436 [quant] Remove deprecated torch.jit.quantized APIs (#118406)
The `torch.jit.quantized` interface has been deprecated since #40102 (June 2020).

BC-breaking message:

All functions and classes under `torch.jit.quantized` will now raise an error if
called/instantiated. This API has long been deprecated in favor of
`torch.ao.nn.quantized`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118406
Approved by: https://github.com/jerryzh168
2024-01-27 18:32:45 +00:00
d03173e88c Unify MYPYINDUCTOR and MYPY (#118432)
The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this.

Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418
2024-01-27 17:23:20 +00:00
42062e2622 [pytree][BE] update treespec is_leaf() access (#116371)
Change `isinstance(treespec, LeafSpec) -> treespec.is_leaf()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116371
Approved by: https://github.com/zou3519
2024-01-27 11:44:57 +00:00
26473460a4 [ET-Vulkan] ExecuTorch Vulkan floor_div (#118428)
Summary: Add a new operator "floor_div" to ET-Vulkan.

Test Plan:
```
[yipjustin@7777.od ~/fbcode (b32108c6c)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate --
File changed: fbcode//executorch/backends/vulkan/test/test_vulkan_delegate.py
Buck UI: https://www.internalfb.com/buck2/90290e5b-d47e-4cac-bc63-9939cc210d1f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649890839142
Network: Up: 2.8KiB  Down: 0B  (reSessionID-e7425cc1-0987-46d8-a7bf-418a660bee5b)
Jobs completed: 19. Time elapsed: 42.6s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Reviewed By: SS-JIA

Differential Revision: D53072722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118428
Approved by: https://github.com/SS-JIA
2024-01-27 11:20:52 +00:00
eqy
8d790abab9 [NCCL][c10d] Log failing pointer if deregistration fails (#118455)
For debugging convenience

CC @minsii @Aidyn-A @syed-ahmed @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455
Approved by: https://github.com/wconstab
2024-01-27 11:03:02 +00:00
dabb90f2a4 Revert "[Exception] [6/N] Remove use of torch::TypeError (#117964)"
This reverts commit 87335fabaeca41f9721ba5d5eb7eafcf70b7afad.

Reverted https://github.com/pytorch/pytorch/pull/117964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117964#issuecomment-1913079096))
2024-01-27 08:44:34 +00:00
suo
bb6eba189f [export][ez] remove unused argument from InterpreterModule (#118364)
small thing I noticed

Differential Revision: [D53113926](https://our.internmc.facebook.com/intern/diff/D53113926/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118364
Approved by: https://github.com/angelayi
2024-01-27 06:46:01 +00:00
89a1175e0e Upgrade mypy python_version to 3.11 (#118418)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118418
Approved by: https://github.com/albanD
ghstack dependencies: #118414
2024-01-27 06:10:46 +00:00
978faf1fa2 Use an op counter to decide when to realize a kernel (#117030)
Instead of checking the number of bytes in the string representation
of the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117030
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-01-27 05:28:46 +00:00
800e2e823f Add compilable foreach RAdam support (#117912)
Fixes https://github.com/pytorch/pytorch/issues/117807

This brings the number of supported optimizers with `torch.compile` to 11/13 (!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912
Approved by: https://github.com/janeyx99
2024-01-27 04:32:27 +00:00
fe10b1800f LazyGraphModule (#117911)
I feel it's easier to open a new PR rather than iterating on the previous PR (https://github.com/pytorch/pytorch/pull/105257 ) since this is more like a rewrite.

In this PR, instead of changing GraphModule directly which can easily causes BC issue, I create a LazyGraphModule class as Zachary & Jason suggested in comments from the previous PR.

The difference between LazyGraphModule and GraphModule is mainly about how re-compile for the graph module happens. In GraphModule the recompilation happens 'eagerly': constructing a GraphModule will cause the recompilation. While in LazyGraphModule, we just mark the module as needing recompilation. The real recompilation only happens when absolutely required (e.g. call forward method, access the code property etc.). In a lot of cases in torch.compile, the real recompilation eventually is not triggered at all. This can save a few seconds of compilation time.

By default, GraphModule rather than LazyGraphModule is used. `use_lazy_graph_module(True)` context manager can be used to pick LazyGraphModule instead. This has been applied to the torch.compile stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117911
Approved by: https://github.com/jansel
2024-01-27 04:10:18 +00:00
70699a6357 [C10D] Add tests for gather and gather_object with subgroup (#118359)
Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359
Approved by: https://github.com/weifengpy
2024-01-27 04:08:56 +00:00
28625d746f [executorch hash update] update the pinned executorch hash (#118443)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118443
Approved by: https://github.com/pytorchbot
2024-01-27 04:08:49 +00:00
993e4f3911 [c10d] relax the nccl error check for nonblocking mode (#118254)
resolve https://github.com/pytorch/pytorch/issues/117749

Summary:
This is the first step to enable NCCL nonblocking mode.

In NCCL nonblocking mode,  ncclInProgress is an expected return value
when checking communicators. Without this relaxation, watchdog thread
would throw NCCL errors during work checking while it is expected.

Test Plan:
Set nonblocking mode in unit tests, and make sure all existing NCCL
tests pass
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254
Approved by: https://github.com/kwen2501
2024-01-27 03:49:00 +00:00
40c08795b0 [JIT] python IR bindings: consolidate tests, add short docs in OVERVIEW.md (#118319)
Document the existence of python IR bindings; quick comments about it; and consolidate tests in one file to serve as examples to users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118319
Approved by: https://github.com/eellison
2024-01-27 03:11:51 +00:00
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
af1338bfbf fix escape nested comments in C++ (#117882)
Fixes #115243, as it is tricky to deal with the nested comment in doxygen + sphinx. Change 6 below is adopted as the fix. All other changes do not work.

After adopting change 6, realize the original
`torch::optim::SGD sgd(0.9);` is not the correct call to the sgd constructor,
modified to the correct one
`torch::optim::SGD sgd(model->parameters(), 0.9);`

- Original in [link](https://pytorch.org/cppdocs/api/function_namespacetorch_1ad98de93d4a74dd9a91161f64758f1a76.html#exhale-function-namespacetorch-1ad98de93d4a74dd9a91161f64758f1a76): `///   torch::optim::SGD sgd(/*lr=*/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/0054b355-4925-4112-93b4-9385fdc34bb9)

- Change 1, this solution is referenced from [here](https://stackoverflow.com/questions/24978463/doxygen-escape-nested-comments-in-c): `///   torch::optim::SGD sgd(/&zwj;* lr= *&zwj;/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/77ff2d18-3097-4265-8dcd-31d78acb9c6e)

- Change 2: `///   torch::optim::SGD sgd(/* lr= *//* 0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/b520f8de-ead7-4009-b0fb-f4517daba077)

- Change 3: `///   torch::optim::SGD sgd(/\*lr=\*/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/07e9e608-4640-43c0-994a-37983b803003)

- Change 4: `///   torch::optim::SGD sgd(/&lowast; lr= &lowast;/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/121e55c5-0802-4ff3-bbd7-3521e1299d94)

- Change 5:
```
/// \rst
/// .. code-block:: cpp
///
///   torch::nn::Linear model(3, 4);
///   torch::load(model, "model.pt");
///   \verbatim
///   torch::optim::SGD sgd(/*lr=*/0.9);
///   \endverbatim
///   std::istringstream stream("...");
///   torch::load(sgd, stream);
///
///   auto tensor = torch::ones({3, 4});
///   torch::load(tensor, "my_tensor.pt");
/// \endrst
```
![image](https://github.com/pytorch/pytorch/assets/7495155/e675f551-e939-4be8-b24a-e2e53377dd08)

- Change 6: `///   torch::optim::SGD sgd(0.9);  // 0.9 is the learning rate`
![image](https://github.com/pytorch/pytorch/assets/7495155/ecf0adc4-9b0b-4aef-b0bc-72d4b17c45fa)
![image](https://github.com/pytorch/pytorch/assets/7495155/01bf5d5b-8450-4599-8c9a-00204ab56119)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117882
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-01-27 02:37:23 +00:00
5b31516008 [dynamo] inline torch.jit._unwrap_optional (#118434)
Before this pr, torch.jit._unwrap_optional is in the skipfile list thus causing a graph break. Check its implementation it's just a normal python function [here](ff8e33556e/torch/jit/_script.py (L1681-L1683)):
```python
def _unwrap_optional(x):
    assert x is not None, "Unwrapping null optional"
    return x
```
We could safely inline it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118434
Approved by: https://github.com/yanboliang
2024-01-27 02:22:14 +00:00
4aa1f994be [dynamo][assume_constant_result] Dont put symbolic guards for assume_constant_result (#118430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118430
Approved by: https://github.com/ydwu4
2024-01-27 01:56:14 +00:00
838d3620cd [NCCL PG] log NCCL comm at creation and abort (#118335)
Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs.

Differential Revision: D53107647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335
Approved by: https://github.com/wconstab
2024-01-27 01:43:53 +00:00
80cb6db90d [CUDA] [CI] Disable flash attention for sm87 architecture when the head dim > 192 (#117678)
Head dim > 192 requires A100/H100 (sm80 or sm90) per TORCH_CHECK [here](0c26565d5d/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp (L760)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117678
Approved by: https://github.com/eqy, https://github.com/malfet
2024-01-27 01:22:47 +00:00
7cc7bf9dda [GHF] Add co-authors to PR (#118347)
Mention co-authors in PR body

Modify `CommitAuthors` to include query first two commit `authors`, which makes sure that authors from suggested commits are recognized.

Test plan: CI + check `get_authors()` on a few PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118347
Approved by: https://github.com/kit1980
2024-01-27 01:02:49 +00:00
4d771c56de [xnnpack] Move x86 flags to platform_compiler_flags (#117923)
Summary:
AVX extension flags are x86 specific, and clang-18 has started to error on it when building targets that's not x86. I couldn't find the resulting upstream change that made these flags an error, but it's fairly trivial that these flags do not apply to all architectures.

For most of the flags, they are already defined in `platform_compiler_flags`. The changes done
* Gate the flags under `compiler_flags` with `selects`
* If flags weren't defined in `platform_compiler_flags`, define them there as well
* Remove the `^` and `$` in the platform regex. Not all flavors start with the platform (e.g. `android-x86_64`.
* Some minor formatting changes were also included here.

Test Plan:
Atop D52741786,
```
buck2 build --flagfile 'arvr/mode/android/apk/linux/opt'  '//arvr/projects/mixedreality/android/ocean_passthrough_service:ocean_passthrough_mrservice_dev'
```

Differential Revision: D52856224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117923
Approved by: https://github.com/mcr229
2024-01-26 23:41:06 +00:00
ff8e33556e Enables load balancing duplicates in DCP (#116469)
Enables the deduplication of saved entries by load balancing duplicates across ranks.

Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**.  Before this PR, the same operation has been measured at ~19 seconds.

```
def run(local_rank, world_size, param_size, num_params, work_dir):

    os.environ["RANK"] = str(local_rank)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)

    model = Model(param_size=param_size, num_params=num_params)
    model = DistributedDataParallel(model, gradient_as_bucket_view=True)
    _patch_model_state_dict(model)

    sz = sum(t.nelement() * t.element_size() for t in model.parameters())
    rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
    rank_0_print("Saving the model with DCP...")

    checkpointer = _FileSystemCheckpointer(
        f"{args.work_dir}/dcp",
        sync_files=False,
        single_file_per_rank=False,
        thread_count=1
    )

    begin_ts = time.monotonic()
    checkpointer.save(state_dict={"model": model})
    end_ts = time.monotonic()
    rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```

Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
2024-01-26 22:34:14 +00:00
b95c45fbf7 add stack trace to device skip (#118112)
Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112
Approved by: https://github.com/bdhirsh
2024-01-26 22:33:48 +00:00
b256b7b348 Add way to actually delete a torch.library.Library object (#118318)
Relying on object lifetimes in Python is a bad idea due to reference
cycles. Previously, when a torch.library.Library object gets destroyed,
it clears all the registrations associated with it, but it's unclear
when it actually gets destroyed due to the existence of refcycles.

This PR:
- adds torch::Library::clear(), which deterministically releases all of
  the RAII registration handles of the torch::Library object
- adds a new `torch.library._scoped_library` context manager, which creates
  a library and cleans it up at the end of the scope using the previous item.
  All tests (unless they already handle library lifetimes) should use
  this new API
- Rewrites some flaky tests to use `_scoped_library`.

In the future we'll probably migrate all of our torch.library tests to
use `_scoped_library`, but that's kind of annoying because we have
multiple thousands of LOC

I'm hoping this will deflake those tests; we'll see.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318
Approved by: https://github.com/albanD
2024-01-26 22:30:51 +00:00
f129e3fe03 [inductor] Handle cum{sum,prod} on zero-dim tensors (#117990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117990
Approved by: https://github.com/lezcano
2024-01-26 22:21:42 +00:00
074ac822d5 [ONNX] Skip empty input test case in aten_mm (#118413)
Fixes #117718
Fixes #117725

It's actually a known issue in https://github.com/microsoft/onnxscript/pull/586, and we do exclude the empty input test cases in aten_matmul. This PR follows the skip, and add aten_mm as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118413
Approved by: https://github.com/thiagocrepaldi
2024-01-26 22:06:57 +00:00
eee63ac845 [dynamo] move torch._C._get_cublas_allow_tf32 to constant_fold_functions (#118342)
Previously, I create a value match for torch._C._get_cublas_allow_tf32, it should just be in constant_fold_functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118342
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #118236
2024-01-26 22:00:00 +00:00
d41cfc92e6 [CI] simplify mergeability check workflow (#118415)
Test run:
https://github.com/pytorch/pytorch/actions/runs/7673050632/job/20914851421?pr=118415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118415
Approved by: https://github.com/PaliC, https://github.com/huydhn
2024-01-26 21:45:24 +00:00
84251d1d71 [ez] Windows log printing + save successful test logs (#118124)
when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps

My guess is windows line ending differences

Also always save log file regardless of success or failure

See 476b81a9bf for what it looks like now

Swapped to opening in text mode instead of binary, seems to be ok now.

42483193bf024983060a234dc0262f4840aef4b8 for example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124
Approved by: https://github.com/huydhn
2024-01-26 21:14:25 +00:00
5c56822be2 [export] Various fixes to .module() (#118272)
Summary: While turning on .module() for all the export tests, I uncovered some bugs with .module() and while fixing them I ended up rewriting some of the code... Some of the bugs were:

* bad kwargs support on the unlifted module
* no support for user input mutations
* (at the commit hash i was working off of) no support for custom objects
* there were no tests on unlifting weights from cond/map submodules

Test Plan: CI

Differential Revision: D53075380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118272
Approved by: https://github.com/suo
2024-01-26 21:05:07 +00:00
2ed1b1747a Fix Auto Functionalize to handle specified default values (#118331)
Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing IndexError: tuple index out of range.

Test Plan: New tests.

Reviewed By: zou3519

Differential Revision: D53095812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118331
Approved by: https://github.com/zou3519
2024-01-26 20:31:38 +00:00
07499074bb Increasing session duration for AWS credentials for _rocm-test.yml (#118412)
The workflow _rocm-test.yml needs longer session duration for AWS role keys

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118412
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2024-01-26 19:32:24 +00:00
939008a268 Fix RuntimeError: NYI: Named tensors are not supported with the tracer (#118393)
This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable.

Repro 1:

```python
import torch

def f(x):
    return x[x > 0]

jf = torch.jit.trace(f, torch.tensor(2., device="cuda"))
```
Error:

```bash
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace
    traced = torch._C._create_function_from_trace(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<stdin>", line 2, in f
RuntimeError: NYI: Named tensors are not supported with the tracer
```

Repro2:

```python
import torch
import torch.nn.functional as F
from torch import nn
import copy

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs):
        x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer
        x = F.relu(x)
        return x

model = Net()
images = torch.randn(8, 28, 28)
torch.jit.trace(model, images)
```

Error 2:

```bash
Traceback (most recent call last):
  File "/opt/pytorch/test_deepcopy.py", line 18, in <module>
  File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace
    return trace_module(
           ^^^^^^^^^^^^^
  File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module
    module._c._create_method_from_trace(
  File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/test_deepcopy.py", line 12, in forward
    x = F.relu(x)
        ^^^^^^^^^^
  File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 126, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NYI: Named tensors are not supported with the tracer
```

----
 #48054 RuntimeError: NYI: Named tensors are not supported with the tracer
 #49538 jit tracer doesn't work with unflatten layer
 #31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.
   - This bug was closed but exists. Multiple comments on it still showing error. This is addressed

Likely fixes the following issues (but untested)

 #63297 Named tensor in tracer
 #2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx

Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with
  "NYI: Named tensors are not supported with the tracer"
This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py

Test plan:
  New unit test added
  Broken scenarios tested locally
  CI

Fixes #48054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393
Approved by: https://github.com/zou3519
2024-01-26 19:31:23 +00:00
bfbb8d8220 Don't manually invoke atexit exit handlers in tests (#118409)
Fixes https://github.com/pytorch/pytorch/issues/104098

This is a bad idea because it runs all the exit handlers and messes with
global state that is necessary for other tests to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118409
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #118152, #118309
2024-01-26 19:11:19 +00:00
728789d850 Deflake stream tests, part 2 (#118391)
I missed these the first time around, some more streams need to be
synchronized.

Fixes https://github.com/pytorch/pytorch/issues/112694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118391
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
2024-01-26 19:10:53 +00:00
e696fa1ee7 [tp] enable rowwise embedding sharding in RowwiseParallel (#118242)
As titled, this PR enables the rowwise embedding sharding in the
RowwiseParallel style, and add tests to ensure it's working as expected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079, #118080
2024-01-26 19:01:24 +00:00
dc8357b397 [dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
2024-01-26 19:01:24 +00:00
910b49c48b [dtensor] rewrite embedding ops using op strategy (#118079)
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
2024-01-26 19:01:15 +00:00
25f72194e8 Realize inputs to DynamicScalar before unwrapping storage (#118125)
Fixes https://github.com/pytorch/pytorch/issues/118102

Unfortunately, the test still fails due to an unrelated problem https://github.com/pytorch/pytorch/issues/117665

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118125
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #117862
2024-01-26 18:08:03 +00:00
96d94f574e Fix several bugs related to unbacked SymInt codegen in inductor (#117862)
Let me tell you, this was a *journey.*

* When we repropagate through FX interpreter in AOTAutograd, this will reallocate unbacked SymInts. We can eliminate all of these fresh allocations by appropriately asserting equalities on them setting up replacements. See also https://github.com/pytorch/pytorch/issues/111950
* The `inner_fn` of Loops can contain references to unbacked SymInts. We must collect them to prevent DCE.
* Export naughtily accessed `_expr` when it should have accessed `expr` on SymNode. Fixed two sites of this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117862
Approved by: https://github.com/bdhirsh
2024-01-26 18:08:03 +00:00
89a0b1df51 fix lint for cudnn codes (#117091)
Fixes the lint issue described in https://github.com/pytorch/pytorch/pull/116759

@albanD Please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117091
Approved by: https://github.com/albanD
2024-01-26 17:53:22 +00:00
2842d3c9d3 [Nested Tensor] view: basic support for ragged_idx != 1 and _unsafe_view (#118317)
Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view:

eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)

If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd).

Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work.

Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317
Approved by: https://github.com/soulitzer
2024-01-26 17:29:37 +00:00
533637d9a3 Revert "Check if enable inside run call (#118101)"
This reverts commit 2abb812a78c0d3976e6eb10114716bcb163480ca.

Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to broke periodic multigpu test some how 6fc015fedc ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1912357321))
2024-01-26 16:41:56 +00:00
f1aef2c094 Don't check is_conj for _refs.linalg.svd (#117972)
The flag is not correctly set when PyTorch is compiled with GPU support resulting in failures in
`test_ops.py::test_python_ref_meta__refs_linalg_svd_cpu_complex`.

Use a similar approach to test_meta and skip the check for this function.

Workaround for #105068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117972
Approved by: https://github.com/lezcano
2024-01-26 15:24:29 +00:00
af8f37c2b6 Revert "Use SEQUENTIAL posix_fadvise on mmapped files (#117805)"
This reverts commit 401aa1a1deaee19909c957d7d56d91341018b4dc.

Reverted https://github.com/pytorch/pytorch/pull/117805 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117805#issuecomment-1912204403))
2024-01-26 14:59:58 +00:00
cyy
6da0e7f84b [Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829
Approved by: https://github.com/albanD
2024-01-26 13:33:24 +00:00
8ff55c7e68 Clarified sampling process of torch.randn for complex dtypes. (#118315)
Fixes #118269.

Clarified the docs of `torch.randn` and `torch.randn_like`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118315
Approved by: https://github.com/lezcano
2024-01-26 13:05:19 +00:00
b66c4eda61 [Inductor] Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/108220 which improves performance of `basic_gnn_gin`, `basic_gnn_sage` and `basic_gnn_gcn` in multi thread test cases. However, it causes performance regression of these 3 models in single thread test case as reported in https://github.com/pytorch/pytorch/issues/117740. Fix the single thread issues in this PR by adding the thread number check to decide whether fallback `scatter_reduce_` or not.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_scatter_using_atomic_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118278
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-01-26 12:43:25 +00:00
0857a3a753 [c10d_functional] fix an issue where mutation on views fails in inductor (#118333)
`_CollectiveKernel.create_inplace` expresses mutation with the newly introduced `MutationOutput` which requires the `layout` of the input. Currently, there's a bug where if the input is a view, `inp.layout` fails. This PR fixes the issue by unwrapping the input if it's a view.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118333
Approved by: https://github.com/wanchaol
2024-01-26 11:13:30 +00:00
4d0b471389 fix key error in pre_grad fx_passes_numeric_check (#118325)
Summary:
```
I0125 121749.865 pyper_config_utils.py:8225] torchdynamo pyper config = TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True)
```
In trainer
```
I0125 12:58:51.832000 4011.139732263132160 torchdynamo_wrapper.py:291  trainer:0:1 ] [pt2] creating torchdynamo backend wrapper with settings TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True) #ai_training_job_id="febe34d9-b2fb-493e-a5cc-6a0b1dc85ad4" #ai_training_local_rank="1" #ai_training_role_rank="1" #mast_job_attempt="2" #mast_job_name="f525072920-TrainingApplication"
...
if config.fx_passes_numeric_check["pre_grad"]:
```

https://www.internalfb.com/diff/D52826442?dst_version_fbid=1115735309429172&transaction_fbid=682438900759710

https://www.internalfb.com/diff/D51838043?dst_version_fbid=336373395892373&transaction_fbid=349901787874069

This diff first fixes the key error to restore broken tests.  Its pyper changes can be addressed later.

https://www.internalfb.com/code/fbsource/[72c19313ed73]/fbcode/caffe2/torch/_inductor/config.py?lines=142-147

Test Plan: buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_mimo_cmf_deterministic_ne_pt2_training_platform__canary_offline_training-launcher -- --build-fbpkg --run-disabled --tests test

Reviewed By: yusuo

Differential Revision: D53102344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118325
Approved by: https://github.com/mengluy0125
2024-01-26 11:02:12 +00:00
8dd1be49b7 [Inductor] Use sleef implementation for CPP backend acosh codegen (#118350)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/118267. Current cpp backend using `f"({x} + ({x}*{x} - {vec_one}).sqrt()).log()"` to calculate `acosh`, the issue happens when input is a large negative value like `-910685.8125`. In this case, `(x*x - 1).sqrt() + x` equals to 0, and `0.log()` returns `-inf`. However, based on the document: https://pytorch.org/docs/stable/generated/torch.acosh.html, negative inputs should returns `Nan`. Using acosh sleef implementation to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_acosh_with_negative_large_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118350
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-26 10:19:40 +00:00
2ea38498b0 [FSDP][BE] Only show state_dict log when the debug level is detail (#118196)
As title

Differential Revision: [D53038704](https://our.internmc.facebook.com/intern/diff/D53038704/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118196
Approved by: https://github.com/rohan-varma, https://github.com/wz337
ghstack dependencies: #118197, #118195
2024-01-26 09:52:36 +00:00
4f4e61bb75 [DCP] Add tests to demonstrate DCP checkpoint conversion (#117773)
As title

Differential Revision: [D52854759](https://our.internmc.facebook.com/intern/diff/D52854759/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117773
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #116248, #117772
2024-01-26 09:39:10 +00:00
644bc69530 [DCP] Allow users to save and load without creating storage reader and writer (#117772)
Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer.

Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772
Approved by: https://github.com/wz337
ghstack dependencies: #116248
2024-01-26 09:08:35 +00:00
fc30bd3b7b Revert "[dtensor] rewrite embedding ops using op strategy (#118079)"
This reverts commit e599a0879684abedec2a28b08b822fd4a4219105.

Reverted https://github.com/pytorch/pytorch/pull/118079 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
bfb5e7642e Revert "[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)"
This reverts commit 8cc02b46c33b5192289e4cf64fa55d685127bfb8.

Reverted https://github.com/pytorch/pytorch/pull/118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
bc67f87559 Revert "[tp] enable rowwise embedding sharding in RowwiseParallel (#118242)"
This reverts commit 7a9012d7e847a6265e70873e9baab70838edd601.

Reverted https://github.com/pytorch/pytorch/pull/118242 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
2c9a90cde6 [ROCm] backward compatible type enums (#118137)
Fixes builds of pytorch using unreleased ROCm packages that are missing type enums introduced in ROCm 6.0 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118137
Approved by: https://github.com/xw285cornell, https://github.com/anupambhatnagar
2024-01-26 08:40:13 +00:00
f8e14f3b46 [PyTorch][Vulkan] Clean up aten::stack (#118314)
Summary:
After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following:
1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`.
2. Add `tensor.dim() == 0` tests.
3. Address `readability-container-size-empty` and `performance-unnecessary-copy-initialization` linter errors.

Test Plan:
Tested on OD.
```
[jorgep31415@29786.od /data/sandcastle/boxes/fbsource (1d0b920e0)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="*stack*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/ops/Unsqueeze.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
3 additional file change events
Buck UI: https://www.internalfb.com/buck2/98bb3bfa-a1d1-440e-8724-b4990c9cc7ca
Network: Up: 1.4MiB  Down: 377KiB  (reSessionID-6eccf420-3951-4942-9350-998803589b8d)
Jobs completed: 17. Time elapsed: 42.6s.
Cache hits: 38%. Commands: 8 (cached: 3, remote: 0, local: 5)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *stack*
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.stack_invalid_inputs
[       OK ] VulkanAPITest.stack_invalid_inputs (27 ms)
[ RUN      ] VulkanAPITest.stack_0d
[       OK ] VulkanAPITest.stack_0d (28 ms)
[ RUN      ] VulkanAPITest.stack_1d
[       OK ] VulkanAPITest.stack_1d (1 ms)
[ RUN      ] VulkanAPITest.stack_2d
[       OK ] VulkanAPITest.stack_2d (148 ms)
[ RUN      ] VulkanAPITest.stack_3d
[       OK ] VulkanAPITest.stack_3d (354 ms)
[----------] 5 tests from VulkanAPITest (561 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (561 ms total)
[  PASSED  ] 5 tests.
```

Reviewed By: copyrightly, liuk22

Differential Revision: D53071188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118314
Approved by: https://github.com/liuk22
2024-01-26 04:28:06 +00:00
2b1ee9be7a [executorch hash update] update the pinned executorch hash (#118339)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118339
Approved by: https://github.com/pytorchbot
2024-01-26 04:26:38 +00:00
0c5da6100f [PyTorch][Vulkan] Clean up aten::unsqueeze (#118311)
Summary:
After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following:
1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`.
2. Add 0->1 `tensor.dim()` tests.
3. Remove `dim == 0` case from shader since that path is never executed. The `cpp` code sends the input to `submit_copy` instead.

Test Plan:
Tested on OD.
```
[jorgep31415@29786.od /data/sandcastle/boxes/fbsource (c66693c95)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="*unsqueeze*"
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
Buck UI: https://www.internalfb.com/buck2/16cf8f59-e535-493b-b123-5952ef8f1453
Network: Up: 21KiB  Down: 1.4MiB  (reSessionID-1219eefd-e78b-4bfd-aef8-8e4b38da82f8)
Jobs completed: 8. Time elapsed: 37.8s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 1, local: 2)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_0dto1d_dim0
[       OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (61 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim0
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim1
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (110 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim0
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (16 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim1
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (58 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim2
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim0
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (16 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim1
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim2
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim3
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 10 tests from VulkanAPITest (270 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (270 ms total)
[  PASSED  ] 10 tests.

```

Also, to improve my confidence in unit tests, I modified [force_flush.py](https://www.internalfb.com/code/fbsource/[6e606c6f62dafd2121e78ffe14ae12f1b6d8d405]/fbcode/wearables/camera/ml/pytorch_vulkan_native/demo/force_flush.py) to run several combinations of `aten::unsqueeze` on OD.

Verified these work as expected.
```
torch.zeros([])
torch.randn([])
torch.rand([])
torch.ones([])
torch.tensor(0, dtype=torch.float)
```

Found that Vulkan in general does not support the following. That's ok though since it's technically a 1d tensor which is not part of my task.
```
torch.tensor([])
```

Differential Revision: D53071189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118311
Approved by: https://github.com/liuk22
2024-01-26 04:22:54 +00:00
8467de4e97 Fix kaiser_window for lower precision data types on CPU (#117345)
Fixes #117230.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117345
Approved by: https://github.com/jgong5, https://github.com/soumith
2024-01-26 03:26:12 +00:00
eqy
ef29fe745f [CUDA] Add missing TF32 annotation to test_uint4x2_mixed_mm (#118143)
Addresses numerical mismatches seen on architectures with TF32.

CC @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118143
Approved by: https://github.com/nWEIdia, https://github.com/jansel
2024-01-26 03:23:22 +00:00
b599f5608c Fix mergeability check for ghstack PRs (#118258)
# Changes
* introduce `--check-mergeability` trymerge flag that attempts to merge PR locally, using the same merge logic as the mergebot, but requires just a read-only `GITHUB_TOKEN` and git repo.
* change mergeability workflow to utilize the new --check-mergeability logic

# Alternatives considered

1.
> Rewrite `https://github.com/pytorch/test-infra/actions/workflows/pr-dependencies-check.yml` to correctly support partially merged ghstacks.

That would be a slightly better approach, but ROI is lower, as it requires reimplementing trymerge logic and additional effort to consolidate the codebase (trymerge lives in pytorch repo).

`pr-dependencies-check.yml` still produces human-readable results for partially merged ghstack prs (even if it falsely reports them as non-mergeable).

2.

> Instead of introducing new trymerge flag, use existing flags, including `--dry-run`.

That didn't work, as no combination of existing flags skips the rule checks and ROCKSET lookups.

# Testing

1. Manual testing  `trymerge.py --check-mergeability`  on the regular and ghstack PRs:

```
export GITHUB_TOKEN=
export GIT_REPO_DIR=`pwd`
export GITHUB_REPOSITORY=pytorch/pytorch
export GIT_REMOTE_URL=https://github.com/pytorch/pytorch

# Test 1 (2 prs, 1 is closed)
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability  117862
Skipping 1 of 2 PR (#117859) as its already been merged

echo $?
0

# Test 2 (3 prs, 1 is closed)
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability  118125
Skipping 1 of 3 PR (#117859) as its already been merged

echo $?
0

# Test 3 (3 prs, intentional conflicts introduced into `main`):

python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability  118125
Skipping 1 of 3 PR (#117859) as its already been merged
stdout:
Auto-merging torch/_inductor/ir.py
Auto-merging torch/_inductor/lowering.py
CONFLICT (content): Merge conflict in torch/_inductor/lowering.py
error: could not apply 66ba5b8792f... Realize inputs to DynamicScalar before unwrapping
...
RuntimeError: Command `git -C /Users/ivanzaitsev/pytorch2 cherry-pick -x 66ba5b8792fa076c4e512d920651e5b6b7e466f4` returned non-zero exit code 1
```

2.  Workflow run:
https://github.com/pytorch/pytorch/actions/runs/7660736172/job/20878651852?pr=118258

<img width="516" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/28fbf0d2-ac2a-4518-b41d-b32b41373747">
<img width="621" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/ddbf8566-a417-43ec-9d0e-f623f4a71313">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118258
Approved by: https://github.com/PaliC, https://github.com/huydhn
2024-01-26 03:15:56 +00:00
4e456fd95b [AOTI] Support scalar to tensor in the ABI-compatible mode (#118024)
Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024
Approved by: https://github.com/ezyang
2024-01-26 03:15:05 +00:00
66c3152e36 [CI] Build docker on larger runners (#118167)
Otherwise it takes 1+h to build CUDA12.1 docker
- Limit UCC builds to just sm_52(M60) and sm_86(A10G), which I think has the biggest impact
- Replace hardcoded `-j6` build parallelism with more dynamic `-j$[$(nproc) - 2]`
- Remove redundant check about Ubuntu-14.04
- Added `DOCKER_BUILDKIT` to parallelize the builds

As result, docker build time drops from 1+h to 35 min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118167
Approved by: https://github.com/huydhn
2024-01-26 02:28:25 +00:00
385d8b32fc Update PocketFFT submodule (#118348)
Accidentally downgraded by force merge of https://github.com/pytorch/pytorch/pull/117804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118348
Approved by: https://github.com/kit1980
2024-01-26 02:01:06 +00:00
3cdd4e236e [inductor][easy] dump triton kernel names in the log (#118313)
This may help debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118313
Approved by: https://github.com/desertfire
2024-01-26 02:00:04 +00:00
7a9012d7e8 [tp] enable rowwise embedding sharding in RowwiseParallel (#118242)
As titled, this PR enables the rowwise embedding sharding in the
RowwiseParallel style, and add tests to ensure it's working as expected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079, #118080
2024-01-26 01:36:24 +00:00
8cc02b46c3 [dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
2024-01-26 01:36:24 +00:00
3d062f9abe Revert "[pytorch][kineto] log process group config in distributed info (#117774)"
This reverts commit 9c1348feb3de872f7cabd807abbc228e7192cd46.

Reverted https://github.com/pytorch/pytorch/pull/117774 on behalf of https://github.com/aaronenyeshi due to This diff is breaking internal jobs, but has been internally reverted ([comment](https://github.com/pytorch/pytorch/pull/117774#issuecomment-1911251092))
2024-01-26 01:10:31 +00:00
6596a3f23d [Export] Remove ScriptObjectMeta (#118241)
Summary: As title. Use CustomObjArgument as ScriptObjectMeta

Test Plan: CIs

Reviewed By: zhxchen17

Differential Revision: D53062230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241
Approved by: https://github.com/zhxchen17
2024-01-26 00:37:19 +00:00
401aa1a1de Use SEQUENTIAL posix_fadvise on mmapped files (#117805)
In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes).

Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...).

With this, they run at ~1.5 GB/s which is still bad but better than before!

It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be.

All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp.

I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805
Approved by: https://github.com/mikaylagawarecki
2024-01-26 00:26:57 +00:00
de9ddd19a5 Various CI settings (#117668)
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)

Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).

Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
2024-01-26 00:17:29 +00:00
8c167f9fc3 [CMake] Explicitly error out if CuDNN older than 8.5 (#118235)
Also update README.md
Fixes https://github.com/pytorch/pytorch/issues/118193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118235
Approved by: https://github.com/zou3519
2024-01-25 23:41:04 +00:00
71757093c5 [dynamo] avoid graph break on torch.backends.cuda.matmul.allow_tf32 (#118236)
Before the PR, we have a graph break for the following test:
```python
    def test_cublas_allow_tf32(x):
        if torch.backends.cuda.matmul.allow_tf32:
            return x.sin() + 1

        return x.cos() - 1
```

In this PR, we first add "torch.backends.cuda" to MOD_INLINELIST to trace through the python binding and get the actual call torch._C._get_cublas_allow_tf32, where it's already a TorchInGraphVariable. Because _get_cublas_allow_tf32 is accessing the same variable as at::globalContext().allowTF32CuBLAS(), which is guarded by dynamo as a global state [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp#L443), we could safely assume it returns a ConstantVariable during tracing.

After this pr, we get the following graph:
```python
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_x_ : torch.Tensor):
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:515 in test_cublas_allow_tf32, code: return x.cos() - 1
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         cos = l_x_.cos();  l_x_ = None
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sub = cos - 1;  cos = None
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (sub,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118236
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-01-25 23:40:23 +00:00
b5c9623835 [export] Add node meta into UnflattenedModule (#118138)
Summary: Reland of #117686

Test Plan: CI

Differential Revision: D53012028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118138
Approved by: https://github.com/zhxchen17
2024-01-25 23:37:41 +00:00
a93940b5db [export] Allow constant outputs + None input/outputs (#117894)
Added support for constant outputs. We will just embed the constant directly into the output, like `return (x, 1)`.
Also adds support for None input/outputs. For None inputs we address it the same way we do to constants, which is that a placeholder with no users will be inserted into the graph, and the None will be embedded into whatever operator is using the None. For None outputs, we will also address the same way we do constants, which is that we embed it into the output, like `return (x, None)`.

Differential Revision: D52881070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117894
Approved by: https://github.com/zhxchen17
2024-01-25 23:37:34 +00:00
24133e44b1 Fix return type hint for list types (#118238)
All single element list types are `Tensor[]` so they will always be Tuple.
I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238
Approved by: https://github.com/ezyang
2024-01-25 23:35:20 +00:00
52c5803088 [NestedTensor] Support ragged_idx != 1 in pointwise ops (#118157)
This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added.

Previously:
* unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1
* binary pointwise ops would produce tensors with nonsense shapes

Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157
Approved by: https://github.com/jbschlosser
2024-01-25 23:34:15 +00:00
91d5f94f85 [FSDP] Idempotent reshard (#117997)
address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510

```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997
Approved by: https://github.com/awgu
2024-01-25 23:29:23 +00:00
b10b08227a Passes process group to _all_gather_keys in dcp.load (#118301)
As title

Fixes #118277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118301
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-01-25 23:07:57 +00:00
02a411d4a6 [mergebot] Dry run for labels + easier to read Dr CI result (#118240)
Dry run open for labels so we can run trymerge locally with dryrun without actually affected the PR

Make Dr.CI results easier to read (previously a massive json dump, now just the job names + ids, in a nicer format)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118240
Approved by: https://github.com/huydhn
2024-01-25 23:06:43 +00:00
26f1da0b1b Fix node traversal when setting up stacktrace preservation hooks (#118252)
We only want to traverse over each node in the graph exactly once, and we do that by inserting nodes into the "seen" set. The issue is that we forget to check the "seen" set when inserting the root nodes. Typically that is not a problem, because the root nodes are from the different outputs and thus usually correspond to different nodes. With split_with_sizes, though all of the outputs correspond to the same node, ands this leads to the node being iterated over 3 times, and 3 sets of hooks being attached to the same node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118252
Approved by: https://github.com/zou3519
ghstack dependencies: #117552, #118234, #118249
2024-01-25 22:56:20 +00:00
b8bd3bb30a Fix aot_autograd seq_nr logic (#118249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118249
Approved by: https://github.com/zou3519
ghstack dependencies: #117552, #118234
2024-01-25 22:56:20 +00:00
3c77a3ed03 export ATen/native/sparse/*.h (#118274)
Fixes #ISSUE_NUMBER

We are trying to adapt `SparsePrivateUse1` in our code. However, I found that `sparse_stup` has not been exposed yet, which makes it impossible for me to implement stup and register. I hope that the header files in this directory can be exposed. @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118274
Approved by: https://github.com/ezyang
2024-01-25 22:47:39 +00:00
fae569b4f2 [dynamo] avoid graph break on tensor.element_size() (#118229)
Before this PR, for the following code, we have a graph break `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor int call_method element_size`
```python
import torch
def f(x):
  return x.sin().element_size() + x.sin()

x = torch.randn(2, 2)
torch.compile(f, backend="eager", fullgraph=True)(x)
```
After this PR, we got the following graph, where element_size() is baked in as a constant.
```python
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_x_ : torch.Tensor):
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: /home/yidi/local/pytorch/test.py:4 in f, code: return x.sin().element_size() + x.sin()
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sin = l_x_.sin()
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sin_1 = l_x_.sin();  l_x_ = None
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add = 4 + sin_1;  sin_1 = None
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (add,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118229
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/anijain2305
2024-01-25 22:28:37 +00:00
bd6bf97ea5 stop using torch.Tensor in dynamo/test_export_mutations.py (#118287)
This causes test flakiness, because torch.Tensor allocates a Tensor with
uninitialized memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118287
Approved by: https://github.com/ydwu4
2024-01-25 22:21:41 +00:00
f7f7283ec7 Skip test_none_names_refcount under Dynamo-wrapped CI (#118309)
Fixes https://github.com/pytorch/pytorch/issues/117716
Dynamo does some things that modifies the refcount. Skipping this test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118309
Approved by: https://github.com/ydwu4, https://github.com/yanboliang, https://github.com/albanD
ghstack dependencies: #118152
2024-01-25 22:21:22 +00:00
4e45d791e7 Remove set_ exclusion in FakeTensor dispatch cache (#118154)
Summary: Now that set_ is marked as a view op, this special case is no longer necessary

Test Plan: CI exposed the need for this special case in the first place, so I think we can just rely on the existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118154
Approved by: https://github.com/bdhirsh
2024-01-25 21:54:36 +00:00
13bdd6c4e2 Revert "[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)"
This reverts commit 3221585af0f78cee20f1fb739e140ab59a517ee1 as this
commit was already landed as 83581f91ca9c3b78b0f8dc3a0a2c1cb229d20e99.
2024-01-25 13:41:39 -08:00
ea851eb027 Uses Serial Loader for DCP.save when more then one thread is used. (#118114)
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.

Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.

Before this PR
9.475 s
8 threads

After this PR
1.632 s
8 threads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin
2024-01-25 21:11:16 +00:00
708e6241ed Fix sympy_subs to preserve integer and non-negative properties. (#118150)
This diff introduce the following changes:
1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string
why is this needed?
I was compiling an expression:
 x*abs(y)  where y =-2
  what happens is that this expression is passed as ``s1*abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs.
 but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true)
 resulting in ``x*abs(ks0) = x*ks0`` which is wrong

2. rename sympy_symbol to sympy_index_symbol to make it explicit.
3. add assertion that replaced expression is not passed as string but always a sympy expression.

Fixes https://github.com/pytorch/pytorch/issues/117757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150
Approved by: https://github.com/ezyang
2024-01-25 20:54:55 +00:00
2de24c11f6 [inductor] Slightly faster memory allocation on CUDA (#118255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255
Approved by: https://github.com/peterbell10
ghstack dependencies: #118065, #118070, #118171
2024-01-25 20:49:14 +00:00
3e76a0e9c2 Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-01-25 20:43:18 +00:00
1565d58ad9 [inductor] correctly generate grid info for benchmark_kernel (#118202)
Previously, we generated the grid argument with tree.numel for
a benchmark TritonKernel. This was not correct, because it
didn't match the launch config used for profiling and running.

This PR fixed the issue by emitting the grid value computed
by the kernel's grid_fn, which is used by the profiler and
the kernel's runner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-01-25 20:37:44 +00:00
b47cf4182e Fix support non tensor inputs to operator.pos function (#118251)
Fixes #118231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118251
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-01-25 20:37:40 +00:00
476b744e23 [AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291)
Summary: https://github.com/pytorch/pytorch/pull/117989 disabled   use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models.

Differential Revision: D53089956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118291
Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov
2024-01-25 20:30:17 +00:00
1f6aa4b336 [mypy] Enable follow_imports = normal for mypy-torch.backends.* (#116311)
Summary:

Test Plan:

```
lintrunner --take MYPYINDUCTOR --all-files
ok No lint issues.

lintrunner -a
ok No lint issues.
Successfully applied all patches.
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311
Approved by: https://github.com/int3
2024-01-25 20:17:22 +00:00
3221585af0 [Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)
With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551
Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin
2024-01-25 20:00:14 +00:00
9768f73cb2 [AOTI] Skip test_index_put_with_none_index on rocm (#118290)
Summary: The test was added in https://github.com/pytorch/pytorch/pull/118187 and is failing on rocm.

Differential Revision: [D53089729](https://our.internmc.facebook.com/intern/diff/D53089729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118290
Approved by: https://github.com/DanilBaibak
2024-01-25 19:36:00 +00:00
83581f91ca [Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)
With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551
Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin
2024-01-25 18:53:41 +00:00
bb3db079b1 [Export] Introduce class_fqn into CustomObjArgument (#118158)
Summary:
Class FQN is needed when unpacking CustomObj instance.
For all other Arguments, e.g. Tensor, TensorList, SymInt, we always know their exact type. However, CustomObjArgument had an opaque type.
Adding this field also helps unveiling the type of this opaque object.

Test Plan: CI

Differential Revision: D53029847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118158
Approved by: https://github.com/zhxchen17
2024-01-25 18:44:25 +00:00
fed0f2946f [FSDP][BE] Fix optim_state_dict_to_load doc errors (#118195)
As title

Differential Revision: [D53038703](https://our.internmc.facebook.com/intern/diff/D53038703/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118195
Approved by: https://github.com/rohan-varma, https://github.com/wz337
ghstack dependencies: #118197
2024-01-25 18:29:04 +00:00
01388d0790 [dynamo] Slightly better error message if key not in dict (#117902)
Was debugging an export issue, and currently when `key` does not exist in `self.items`, the error message is
```
  File "/opt/pytorch/torch/_dynamo/variables/dicts.py", line 208, in getitem_const
    return self.items[key]
           ~~~~~~~~~~^^^^^
torch._dynamo.exc.InternalTorchDynamoError: <torch._dynamo.variables.dicts.ConstDictVariable._HashableTracker object at 0x7fd7697cbf90>
```
This PR changes it to be the following.
```
File "/data/users/angelayi/pytorch/torch/_dynamo/variables/dicts.py", line 199, in getitem_const
    raise KeyError(arg.value)
torch._dynamo.exc.InternalTorchDynamoError: shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117902
Approved by: https://github.com/williamwen42
2024-01-25 18:13:40 +00:00
e1f9eca113 [DeviceMesh] Reuse sub_group pg if exists (#115716)
Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:
1) re-use sub_group pg if it exsits,
2) create new sub_group pg if it does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716
Approved by: https://github.com/wanchaol
2024-01-25 18:07:16 +00:00
a289dba7b1 Add missing cuda libraries for context_gpu_test (#117493)
This adds some missing cuda (curand and cublas) libraries that are required for the context_gpu_test to link.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117493
Approved by: https://github.com/ezyang
2024-01-25 18:04:23 +00:00
eb054cc012 Revert "Fix Auto Functionalize to handle specified default values (#118035)"
This reverts commit 2d7a360911fb7b27be82c51ca86b4b34b6f1b087.

Reverted https://github.com/pytorch/pytorch/pull/118035 on behalf of https://github.com/zou3519 due to needs internal changes, reverting so we can land via co-dev ([comment](https://github.com/pytorch/pytorch/pull/118035#issuecomment-1910706841))
2024-01-25 17:53:15 +00:00
8810fdd21e fsdp: Unit test for ModuleWrapPolicy as a Callable (#117395)
We use `_or_policy` as a `Callable` to wrap a `ModuleWrapPolicy` instance as a `Callable`.

Fixes https://github.com/pytorch/pytorch/issues/109266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117395
Approved by: https://github.com/wconstab
2024-01-25 17:40:06 +00:00
c1e0674485 [DCP][BC] Remove the dependency on _shard.TensorProperties (#116248)
ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties

Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248
Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337
2024-01-25 17:24:16 +00:00
316579e30c [FSDP2] Introduced initial fully_shard frontend (#117776)
This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP.
- We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one.
- We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module.
    - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`.
    - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able.
- Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state.
- We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794).
- In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117776
Approved by: https://github.com/wconstab, https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #117994, #118186, #117984
2024-01-25 17:22:07 +00:00
4f78869c18 [state_dict] Calls wait() for the DTensor to_local() result (#118197)
See the discussion in https://github.com/pytorch/pytorch/pull/117799.

There are some issues when returning a AsyncCollectiveTensor (haven't found the
root causes), including OOM and unexpected values.

This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream.

Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118197
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-01-25 17:14:08 +00:00
817debeb89 [inductor] Slightly faster memory allocation on CPU (#118171)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `12.2us`
- After `10.5us`

This is inspired by a2c17a2b00 -- but in Python rather than C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #118065, #118070
2024-01-25 16:54:57 +00:00
d6b556bd98 Added "any" mode to register_multi_grad_hook (#117984)
This is a re-open of https://github.com/pytorch/pytorch/pull/115628/. This PR adds an `"any"` option to `register_multi_grad_hook` that runs the hook when the gradient of _any_ of the input tensors is computed. The existing functionality is folded under the default `"all"` mode.

The multi-threaded test case is based on the existing one for `register_multi_grad_hook`. I would appreciate a closer look on that. ~~I am not sure about the hook signature (i.e. why we see two gradients in the hook that runs instead of just one, as [`register_hook`](https://pytorch.org/docs/stable/generated/torch.Tensor.register_hook.html) docs suggest).~~ It was because I was iterating over the 2 elements in the single tensor 😢 .

I did not update the `notes/autograd.rst`, which currently has a [blurb](https://pytorch.org/docs/stable/notes/autograd.html#special-hooks) on `register_multi_grad_hook`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117984
Approved by: https://github.com/soulitzer
ghstack dependencies: #117994, #118186
2024-01-25 16:25:52 +00:00
173777461c expose nested tensor header file (#117956)
This pr is for expose nested tensor related header files, it will makes other people easier when developing nested tensor related kernel for extension module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117956
Approved by: https://github.com/ezyang
2024-01-25 15:53:10 +00:00
865945cc1f Convert requires_cuda to full decorator (#118281)
Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times

Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281
Approved by: https://github.com/ezyang
2024-01-25 15:50:21 +00:00
87fb8b6218 [DTensor] Relaxed to_local requires_grad warning (#118186)
The existing warning in `DTensor.__new__()` checks `if requires_grad != local_tensor.requires_grad:` and warns with:

> To construct DTensor from `torch.Tensor`, it's recommended to use `local_tensor.detach()` and make `requires_grad` consistent.

Calling `local_tensor.detach()` will have the returned `Tensor` have `requires_grad=False`, so the error message refers to the case where `local_tensor.requires_grad is True` but the user passed `requires_grad=False` to `to_local()`.

However, there is the converse case, where `local_tensor.requires_grad is False` but the user passed `requires_grad=True`. In this case, the original `if requires_grad != local_tensor.requires_grad:` check succeeds, and the warning is emitted. However, the warning message does not apply in that case.

This can happen via `_prepare_output_fn` -> `redistribute` -> `Redistribute.forward()`, where `output.requires_grad is False` but it passes `requires_grad=input.requires_grad` which can be `True`.

We should not warn in this case since `Redistribute.forward()` is our own framework code, so I was proposing to relax the warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118186
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #117994
2024-01-25 15:49:32 +00:00
a5230e6019 [ez][docs] Fixed render of tensors in backward (#117994)
Before:
<img width="851" alt="Screenshot 2024-01-22 at 2 03 49 PM" src="https://github.com/pytorch/pytorch/assets/31054793/a71111ab-c7c4-4af5-a996-cbd42bcc8326">

After:
![Screenshot 2024-01-23 at 7 13 40 PM](https://github.com/pytorch/pytorch/assets/31054793/36db28a0-a96f-434c-a93f-fe78aff1e035)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117994
Approved by: https://github.com/soulitzer, https://github.com/weifengpy
2024-01-25 15:49:32 +00:00
8f973038d5 Update update_failures.py given feedback (#118237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118237
Approved by: https://github.com/drisspg
2024-01-25 15:42:01 +00:00
b5b36cf0c4 Fix failure of test_dynamo_distributed & test_inductor_collectives (#117741)
When CUDA is not available `c10d.init_process_group("nccl"...)` will fail with
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Hence add a corresponding skip marker to the classes deriving from DynamoDistributedSingleProcTestCase next to the `requires_nccl` marker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117741
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-25 13:25:36 +00:00
ee1dbb2acf [AOTI] Fix a None as index codegen issue (#118187)
Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices.

Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168, #118169
2024-01-25 11:53:44 +00:00
d1e661a1ce [AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169)
Summary: _scaled_dot_product_efficient_attention is used in some TIMM models

Differential Revision: [D53032358](https://our.internmc.facebook.com/intern/diff/D53032358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118169
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168
2024-01-25 11:53:44 +00:00
5c7a18c5cb [AOTI] Refactor shim_common.cpp (#118168)
Summary: Use new_tensor_handle to reduce code repetition

Differential Revision: [D53032353](https://our.internmc.facebook.com/intern/diff/D53032353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118168
Approved by: https://github.com/chenyang78
2024-01-25 11:53:29 +00:00
4b4e6550f2 Update oneDNN build option for older systems (#118057)
Fixes [#116623](https://github.com/pytorch/pytorch/issues/116623).

As we discussed in https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900406773 and https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900825829, we update oneDNN build option to support older systems and document we only support CPUs with SSE4.1+.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118057
Approved by: https://github.com/malfet
2024-01-25 11:34:51 +00:00
eebe7e1d37 Migrate update-viablestrict to test-infra (#118163)
In https://github.com/pytorch/test-infra/pull/4905, so that ExecuTorch can use the same GHA on their CI.

### Testing

https://github.com/pytorch/pytorch/actions/runs/7634906738/job/20799502532#step:2:15480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118163
Approved by: https://github.com/clee2000
2024-01-25 07:07:34 +00:00
357a06f7c9 [ONNX] Fix type promotion pass (#118246)
Currently, when `node.meta["val"]` is `torch.Sym*`, its `hint` [is extracted](61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L86)) and used in type promotion. However, it will [override](61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1409)) dynamic shape information carried in `node.meta["val"]` during [type propagation](61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1401)) and the FX graph seen in `onnxrt` always carries static shapes. Let's use `torch.Sym*` directly so that the type promotion propagates and stores dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118246
Approved by: https://github.com/titaiwangms
2024-01-25 07:04:18 +00:00
2c6a233c45 Report the type of a tensor in wrap_to_fake (#118220)
This could help diagnose why a tensor wasn't considered static.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118220
Approved by: https://github.com/albanD, https://github.com/bdhirsh
ghstack dependencies: #118215, #118217
2024-01-25 06:53:12 +00:00
8b95fb4eb8 Add stack trace to "start tracing" log (#118217)
When debugging problems on unfamiliar model code, I often want to know
"how did I end up in this compiled region."  Printing the stack trace at
tracing start lets me find out this information.

Looks like this:

```
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f /data/users/ezyang/c/pytorch/b.py:3
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/b.py", line 9, in <module>
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     f(torch.randn(5))
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 437, in _fn
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 601, in catch_errors
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 743, in _convert_frame
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 386, in _convert_frame_assert
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 526, in compile_inner
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 473, in transform
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/symbolic_convert.py", line 2030, in __init__
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118217
Approved by: https://github.com/albanD
ghstack dependencies: #118215
2024-01-25 06:53:12 +00:00
2a178dade8 Augment create_symbol with user/infra backtrace fragment (#118215)
Looks like this:

```
[2024-01-24 11:59:41,656] [0/1] torch.fx.experimental.symbolic_shapes: [INFO] create_symbol s0 = 5 for L['x'].size()[0] [2, 9223372036854775806] at b.py:5 in f (_dynamo/variables/builder.py:1788 in <lambda>)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118215
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2024-01-25 06:53:12 +00:00
514159ddcb Add torch_dynamo to resume_in for ease of debugging (#118201)
resume_in_* code objects show up in user backtraces when failures occur
in code that has been Dynamo processed.  It is obvious to me, a PT2
developer, that these are generated by PT2, but it is NOT obvious to a
non-core dev that this is happened.  Add an extra torch_dynamo
breadcrumb to help get people to the right place.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118201
Approved by: https://github.com/albanD
2024-01-25 06:52:17 +00:00
5a83c47d98 [vision hash update] update the pinned vision hash (#117594)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117594
Approved by: https://github.com/pytorchbot
2024-01-25 05:33:01 +00:00
e0903b0720 [executorch hash update] update the pinned executorch hash (#118040)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118040
Approved by: https://github.com/pytorchbot
2024-01-25 05:27:53 +00:00
e5e9f390be [dynamo] Optimize overheads from _TorchDynamoContext (#118070)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `18.1us`
- After `12.2us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118070
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #118065
2024-01-25 05:04:56 +00:00
a40951defd [C10D] Fix nccl flightrecorder ignored dump timeout (#118142)
Don't call future.get() unless it's ready, because it waits.
Also, refactor the code a bit for simplicity.

We should do a follow-on PR to clean up the timeouts further, but this
should fix the glaring timeout bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118044, #118046, #118047
2024-01-25 04:25:36 +00:00
cyy
87335fabae [Exception] [6/N] Remove use of torch::TypeError (#117964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117964
Approved by: https://github.com/albanD
2024-01-25 03:35:58 +00:00
67300a11cb Support custom autograd Function forward AD return non-Tensor in forward (#118234)
Fixes https://github.com/pytorch/pytorch/issues/117491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118234
Approved by: https://github.com/albanD
ghstack dependencies: #117552
2024-01-25 03:24:29 +00:00
2d7a360911 Fix Auto Functionalize to handle specified default values (#118035)
Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing `IndexError: tuple index out of range`

Test Plan: new tests

Differential Revision: D52977644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118035
Approved by: https://github.com/williamwen42
2024-01-25 01:22:12 +00:00
4a49e2b52d refactoring (#118111)
No real changes, just moving mutation checking skip to a helper file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118111
Approved by: https://github.com/bdhirsh
ghstack dependencies: #118110
2024-01-25 00:36:46 +00:00
4448f2a49d Log stack trace of mutated idx reland (#118110)
Relanding of https://github.com/pytorch/pytorch/pull/117720 with a fixed `next(iter(dict.values()))` instead of `next(dict.values())` and a corresponding test that would have caught the problem (as well as a type annotation that also would have).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118110
Approved by: https://github.com/bdhirsh
2024-01-25 00:30:03 +00:00
5b819d9ef0 Properly move retains_grad hook on in-place over view for base (#117552)
Fixes https://github.com/pytorch/pytorch/issues/117366
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117552
Approved by: https://github.com/albanD
2024-01-25 00:27:13 +00:00
9c1348feb3 [pytorch][kineto] log process group config in distributed info (#117774)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Test Plan: Tested in HPC

Differential Revision: D52882292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774
Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi
2024-01-25 00:08:10 +00:00
89530c8590 [dynamo] Test for using torch.nn when replay_records are enabled (#116215)
This adds a reproducer for a failure that has since been fixed in main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116215
Approved by: https://github.com/jansel
ghstack dependencies: #116230, #116214
2024-01-24 23:42:35 +00:00
7c33ce7702 [CI] Install dill in ci (#116214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116214
Approved by: https://github.com/malfet
ghstack dependencies: #116230
2024-01-24 23:42:35 +00:00
b53cc6cf8d [dynamo] Fix test_replay_record.py (#116230)
This test isn't run in CI because the CI runners don't have dill installed.
This fixes the tests so they run for me locally, and in the next PR I add
dill to the CI so we can test it properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116230
Approved by: https://github.com/jansel
2024-01-24 23:42:35 +00:00
61865205b6 Deflake Dynamo stream tests (#118205)
streams need to be synchronized, otherwise, there is undefined behavior.
This PR adds the necessary synchronization. This exposed some bugs
(https://github.com/pytorch/pytorch/issues/118204), so I just marked the
tests as expectedFailure.

Test Plan:
- tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118205
Approved by: https://github.com/yanboliang
2024-01-24 23:31:47 +00:00
5e0ef84b01 [dynamo] Refactor install_global_once, remove usages of install_global_unsafe (#118100)
We split install_global_once into two APIs:
- `install_global_by_id(prefix, value) -> name`: installs a global if it hasn't
been installed yet
- `install_global(prefix, value) -> name`: always installs the global (and
  generates a unique name for it)

Then, we refactor most callsites of `install_global_unsafe` to one of
the previous. Some callsites cannot be refactored because we create the
global name first, do a lot of stuff with it, and then install it.

This fixes more test flakiness.

Test Plan:
- Existing tests; I can't reliably repro the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118100
Approved by: https://github.com/ezyang, https://github.com/mlazos
2024-01-24 23:25:44 +00:00
2abb812a78 Check if enable inside run call (#118101)
In theory this way we never have to worry about subclasses calling super().setUp() ever again

Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101
Approved by: https://github.com/huydhn
2024-01-24 22:38:41 +00:00
dba160e676 [13/N][Dynamo] Refactor torch ctx manager classes check out of trace_rules.lookup (#118130)
I'm going to merge inline/skip/allow_in_graph check into ```trace_rules.lookup```, so it's better to make it only handle function types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118130
Approved by: https://github.com/williamwen42
2024-01-24 22:33:41 +00:00
4e29f01bf2 Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689)
# Summary
Simplification of Backend Selection

This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.

For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.

Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.

Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).

A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
2024-01-24 22:28:04 +00:00
77186af028 [DTensor][BE] re-enable test_dtensor_ops in CPU CI (#118134)
**Test**
`pytest test/distributed/_tensor/test_dtensor_ops.py`
This only runs CPU test and completes in 1 minute on local.
<img width="3002" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/bfbcaff0-2581-41a7-817d-f68e4041b8b1">

CI Run: https://hud.pytorch.org/pr/pytorch/pytorch/118134
Search for "distributed" test and click any of them. Then search for "test_dtensor_ops". Saw successful run of `test_dtensor_ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118134
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/wanchaol
ghstack dependencies: #117726, #118132
2024-01-24 22:11:51 +00:00
e6288820e3 Revert "Update triton ROCm version to 6.0" (#118179)
Reverting [this commit](https://github.com/pytorch/pytorch/pull/117433) due to failures observed in wheel environment e.g:
```
ImportError: /tmp/torchinductor_root/triton/0/ebfa57c0b7b95873c96cad6f9bca148d/hip_utils.so: undefined symbol: hipGetDevicePropertiesR0600`
```

Will revert for now and investigate and aim to re-land this as part of https://github.com/pytorch/pytorch/pull/116270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118179
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-01-24 22:01:27 +00:00
af9b6fa04e Revert "Check if enable inside run call (#118101)"
This reverts commit 6fc015fedc96e532da756e9408fcedb9c81a423f.

Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to possibly causing failures on b025e5984ce30eed10df0cc89111e88983d823d3 ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1908940940))
2024-01-24 21:26:35 +00:00
15608d8cb4 Add guardrails preventing complex params in LBFGS & SparseAdam (#118161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118161
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #118160
2024-01-24 21:22:47 +00:00
17ecd1e9cd Migrate test_complex_optimizer to OptimizerInfo (#118160)
This PR does what it says and more.

1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about.
2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense.
3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160
Approved by: https://github.com/mikaylagawarecki
2024-01-24 21:22:47 +00:00
6978c3ddf3 Removes an Incorrect Type Specification from AdaptiveMaxPool1d (#118162)
The return type for the forward pass of nn.AdaptiveMaxPool1d is specified to be Tensor, but if self.return_indices, then the result type should be tuple[Tensor,Tensor].

For users trying to trace/script this function with indices, the incorrect typing is problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118162
Approved by: https://github.com/albanD
2024-01-24 20:31:02 +00:00
821b2c543c [AOTI] Support .item() in the ABI-compatible mode (#117989)
Summary:

Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989
Approved by: https://github.com/ezyang, https://github.com/chenyang78
2024-01-24 20:17:59 +00:00
2f6fc33c20 Move skip sets into a new file. (#118032)
This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more
readable YAML file, so that it is consumable from other projects (e.g. XLA).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-01-24 19:22:01 +00:00
e599a08796 [dtensor] rewrite embedding ops using op strategy (#118079)
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
2024-01-24 19:12:12 +00:00
b025e5984c Get Device instance with correct type when privateuse1 backend is registered (#117966)
Fixes #ISSUE_NUMBER
If privateuse1 backend is registered. Let torch.device return corresponding instance of Device when only index is given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117966
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-24 19:03:18 +00:00
6fc015fedc Check if enable inside run call (#118101)
In theory this way we never have to worry about subclasses calling super().setUp() ever again

Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101
Approved by: https://github.com/huydhn
2024-01-24 18:51:05 +00:00
fc135454ca [PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check (#118105)
Summary:
We observed the following error when launch e2e AFOC model test
```
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
```
f524190245

Differential Revision: D53011463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118105
Approved by: https://github.com/jackiexu1992
2024-01-24 18:45:10 +00:00
1e185c7803 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-24 18:42:14 +00:00
6e78592cbb Added type checking for ExportedProgram (#117231)
Fixes #116952

Added type checking for ExportedProgram in save function. Please review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117231
Approved by: https://github.com/avikchaudhuri
2024-01-24 18:24:44 +00:00
af1ebc45d3 [quant][pt2e] Add fold_quantize=True for all convert_pt2e calls (#117797)
Summary: In preparation for enabling fold_quantize=True by default

Test Plan: CI

Differential Revision: D52879612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117797
Approved by: https://github.com/andrewor14
2024-01-24 17:54:13 +00:00
90b3cf33ac [C10] Make Scalar constructable from longs (#118149)
On Linux and Mac `int64_t` is an alias to either `long` (Linux) or  `long long` (Mac)

Because of that, attempt to construct `c10::Scalar` from the other type will fail with `conversion from ‘long long int’ to ‘c10::Scalar’ is ambiguous`.

I.e. attempt to compile:
```cpp
int main() {
  c10::Scalar s = 1L;
}
```
on MacOS failed with:
```
foo.cpp:3:15: error: conversion from 'long' to 'c10::Scalar' is ambiguous
  c10::Scalar s = 1L;
              ^   ~~
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
      DEFINE_IMPLICIT_CTOR)
      ^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:62:3: note: candidate constructor
  Scalar(uint16_t vv) : Scalar(vv, true) {}
  ^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:63:3: note: candidate constructor
  Scalar(uint32_t vv) : Scalar(vv, true) {}
  ^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:64:3: note: candidate constructor
  Scalar(uint64_t vv) {
  ^

```

Prevent this by providing missing constructors when needed. Alas one can not use SFINAE, as template constructors on Scalar mess up a lot of implicit conversions, so I use  `static_asserts` to  detect early on if premise for constructing this class holds.

Add ScalarTest::LongsAndLongLongs that is essentially a compile time test

Discovered while trying to enable AOTI on MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118149
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #118077, #118076
2024-01-24 17:32:29 +00:00
880f9bb57e Remove xfails for consistently succeeding tests (#118152)
Fixes https://github.com/pytorch/pytorch/issues/117786, https://github.com/pytorch/pytorch/issues/117785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118152
Approved by: https://github.com/yanboliang
2024-01-24 15:47:55 +00:00
bd99115276 [AOTI] Enable for MacOS (#118076)
- Add `darwin` to the list of supported platform
- Add `#include <sstream>` to `aoti_runtime/model.h`
- Refactor Linux specific constant compilation logic to `_compile_consts_linux`
- Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library
   - Patch file using magic to avoid converting bytes to large hexadecimal string
- Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition)
- Enable test_aot_inductor.py tests on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076
Approved by: https://github.com/desertfire
ghstack dependencies: #118077
2024-01-24 14:24:05 +00:00
a545ebc870 Switched macOS runners type to macos-m1-stable (#117651)
Switched macOS runners type to `macos-m1-stable`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117651
Approved by: https://github.com/huydhn
2024-01-24 11:55:13 +00:00
12662f4d95 [dynamo] add username in debug path (#117820)
Summary: No user name may cause conflict and permission error when people share a dev server

bypass-github-pytorch-ci-checks

Test Plan: ci

Differential Revision: D52895486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117820
Approved by: https://github.com/kflu, https://github.com/DanilBaibak
2024-01-24 10:14:20 +00:00
7d396918c6 [Inductor] Fix argument unused during compilation warning (#118077)
By not passing linker flag if `compile_only` is set to `True`
Before that change every invocation of AOTI compiler resulted in emitting at least 4 warnings:
```
clang: warning: -lomp: 'linker' input unused [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-undefined dynamic_lookup' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-L/Users/nshulga/miniforge3/lib' [-Wunused-command-line-argument]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118077
Approved by: https://github.com/desertfire
2024-01-24 09:52:16 +00:00
50ead5d8ae [fx] add an option to not retrace when doing op fusion (#118120)
Summary: If the given model is already a graph module, we would want to skip retrace in some cases.

Test Plan: CI

Differential Revision: D53018283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118120
Approved by: https://github.com/zyan0
2024-01-24 09:41:26 +00:00
c5702a0891 [dynamo] Optimize BACKEND_MATCH guard (#118065)
As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `22.5us`
- After `18.1us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065
Approved by: https://github.com/ydwu4
2024-01-24 07:47:52 +00:00
ed0ec2e0be Remove dynamo runner's dependency on distributed build (#117903)
So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903
Approved by: https://github.com/xuzhao9
2024-01-24 06:51:14 +00:00
725f4b58ac Cache dfs path in propose_partitions and re-use that later when trying to find cycles in the graph (#115943)
Summary:
This diff introduces a caching mechanism to improve the performance of the partitioner in PyTorch. The changes involve adding a cache to store the DFS path of each node in the graph, which can be reused later when trying to find cycles in the graph.

This shows significant improvements for the edge use cases where the ASR model (which is around 6000+ nodes) used to take 26 minutes, but after this it takes around 8 minutes.

Test Plan: Relying on the existing ExecuTorch CI tests that heavily use this partitioning mechanism and also tested out locally via Bento notebooks.

Differential Revision: D51289200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115943
Approved by: https://github.com/SherlockNoMad
2024-01-24 05:30:11 +00:00
d59c2d6e05 [dtensor] refactor partial redistribution logic (#113334)
This PR:

* Make the remaining placement transform to move from redistribute.py to
placement_types, specifically partial related logic
* redefine partial interface to make things more consistent, and add
  docs about the transformation relationships

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113334
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #118078
2024-01-24 04:56:16 +00:00
03205ff3ba [dtensor] make local_shard_size_on_dim be staticmethod (#118078)
As titled, this is so that we can use it for the case when we don't need
to construct a Shard placement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118078
Approved by: https://github.com/XilunWu
2024-01-24 04:56:16 +00:00
8d49737f2b [CUDA][Complex] Bump thresholds for conv3d (#118151)
Seeing a 1/1000 numerical mismatch

CC @coyotelll

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118151
Approved by: https://github.com/ezyang
2024-01-24 04:18:31 +00:00
46c228f0e2 [DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output (#116437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116437
Approved by: https://github.com/wanchaol
2024-01-24 03:33:58 +00:00
26968cefb0 [DTensor][fix] re-enable [add]mm tensor test (#118132)
**Summary**
Re-enable tests that were disabled in #118045 as #117726 fixed the empty tensor case for DTensor [add]mm.

**Test Plan**
`pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118132
Approved by: https://github.com/malfet
ghstack dependencies: #117726
2024-01-24 03:17:18 +00:00
155f27a97b [DTensor][fix] fix is_tensor_shardable to correctly handle Replicate placement (#117726)
**Summary**
Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`.

In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131

**Test Plan**
```
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32
CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726
Approved by: https://github.com/wanchaol
2024-01-24 03:17:18 +00:00
e9c240670f [sigmoid] Add canonicalized IR as an option. (#116758)
Summary: as title, the "canonical" flag is added to sigmoid serializer, so that we can optionally "normalize" the IR to give stable names and orders to IR nodes, which could help with the cases to compare IR definitions.

Test Plan: buck run @//mode/opt //aps_models/ads/config_model_authoring/stability:cli export-generated-module-state-command

Differential Revision: D52431965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116758
Approved by: https://github.com/avikchaudhuri
2024-01-24 03:11:25 +00:00
21e8546b11 [inductor][fx] Fix broadcast_tensors with unbacked symints when translation validation is off (#118066)
## Context
This is an example that runs into an AssertionError while lowering in Inductor.
```
# While lowering, b will be expanded because b.size(1) == 1.
a = torch.zeros([u0, 512])
b = torch.ones([u0, 1])
return a * b
```

Below's the tail-end of the stack trace. Here's the important bits:
1. In _inductor/sizevars.py, we'll call `self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)`.
2. This leads to the creation of a `ShapeEnvEvent` with an FX node via `kwargs={"fx_node": V.graph.current_node}` ([see](0c9b513470/torch/fx/experimental/recording.py (L245-L247))).
3. Eventually, we try to call `maybe_convert_node()` but it expects translation validation to be on ([see](0c9b513470/torch/fx/experimental/recording.py (L118-L121))).
```
  File "pytorch/torch/_inductor/lowering.py", line 221, in transform_args
    for i, x in zip(indices, broadcast_tensors(*[args[i] for i in indices])):
  File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "pytorch/torch/_inductor/lowering.py", line 676, in broadcast_tensors
    x = expand(x, target)
  File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "pytorch/torch/_inductor/lowering.py", line 793, in expand
    return TensorBox(ExpandView.create(x.data, tuple(sizes)))
  File "pytorch/torch/_inductor/ir.py", line 1871, in create
    new_size = cls._normalize_size(x, new_size)
  File "pytorch/torch/_inductor/ir.py", line 1862, in _normalize_size
    new_size[i] = V.graph.sizevars.expect_equals(
  File "pytorch/torch/_inductor/sizevars.py", line 338, in expect_equals
    self.expect_true(sympy.Eq(left, right), msg=msg)
  File "pytorch/torch/_inductor/sizevars.py", line 333, in expect_true
    self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)  # (1) is here
  File "pytorch/torch/fx/experimental/recording.py", line 257, in wrapper
    return event.run(self)   # (2) happens right before this
  File "pytorch/torch/fx/experimental/recording.py", line 155, in run
    replacearg(index=3, key="fx_node", fn=maybe_convert_node)
  File "pytorch/torch/fx/experimental/recording.py", line 138, in replacearg
    kwargs[key] = fn(kwargs[key])
  File "pytorch/torch/fx/experimental/recording.py", line 128, in maybe_convert_node
    assert hasattr(shape_env, "name_to_node")  # (3) is here
```

## Approach
Since [translation validation](c6be5d55a5/torch/fx/experimental/validator.py (L574)) may not be on during Inductor lowering, we can check if that's True and return the FX node's name in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118066
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2024-01-24 03:07:30 +00:00
41a56f7828 Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955)
This PR intends to fix the following issue when swapping two tensors

```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581,  0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581,  0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```

What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.

57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)

When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.

The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
fc30c4d769 Migrate forloop directional tests to OptimizerInfo (#117410)
This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`.

**Changes in coverage?** Yes!
- This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test.
- This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed.

What will it take to fully remove `test_basic_cases`?
- We need to flavor the tests with LRSchedulers
- Testing for param groups --> which all just distinguish between lrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410
Approved by: https://github.com/albanD
2024-01-24 01:28:40 +00:00
5b671ce486 [dynamo] fix typo in 3.11 resume_execution.py (#118108)
whoopsie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118108
Approved by: https://github.com/angelayi, https://github.com/zou3519
2024-01-24 00:59:04 +00:00
b7b1affe97 Add half specializations for load of sum (#106454)
Add half specializations for load of sum

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106454
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-24 00:35:20 +00:00
c0732c8d5e [Dynamo] Add complex to literal constant (#117819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117819
Approved by: https://github.com/zou3519
2024-01-23 23:46:46 +00:00
cd084c4909 Add TensorIteratorConfig::add_const_input to avoid COW materialize (#118053)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053
Approved by: https://github.com/ezyang
2024-01-23 22:32:39 +00:00
abd759d50d [fx] Add hooks to intercept node replacements. (#117825)
Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places.

Test Plan:
buck test mode/opt  -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform

buck test mode/opt caffe2/test:test_export -- -r test_replace_hook

Differential Revision: D52896531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825
Approved by: https://github.com/avikchaudhuri
2024-01-23 22:28:40 +00:00
b369888bec Replace constraints with dynamic_shapes in caffe2/test/cpp & torchrec/distributed/tests/test_pt2 (#118026)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`.

Test Plan: CI

Differential Revision: D52977354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026
Approved by: https://github.com/chenyang78
2024-01-23 22:15:15 +00:00
6ac284122b [Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055)
Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack.

Test Plan: CI

Differential Revision: D52984953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055
Approved by: https://github.com/zdevito
2024-01-23 21:48:57 +00:00
c6930aad46 Update Triton pin (#117873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117873
Approved by: https://github.com/shunting314, https://github.com/malfet
2024-01-23 21:05:30 +00:00
13d2cdffa2 Remove optimizer.step patching for profiler hook (#115772)
1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR.

```
I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x.
I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x.
```

2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one).
3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below)

<details>
This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done).

There is no Python refcycle, as the backrefs for `p_ref()` look like:
![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8)
(so 5 backrefs but none of them python)

And the refs:
![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42)
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-23 20:15:41 +00:00
77705e7486 [dtensor] fix unnecessary redistribute in new_factory_strategy (#118037)
**Summary**
Previously, assuming `x` is a DTensor with non-replicate placement, calling `x.new_full` would create a replicated (but unused) copy of `x`, incurring unnecessary communications. This PR fixes the issue.

**Test**
`python test/distributed/_tensor/test_tensor_ops.py -k test_new_full`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118037
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-01-23 19:35:43 +00:00
58e7ec5843 Revert "Log stack trace of mutated idx (#117720)"
This reverts commit 365c7a292fedbf776014b878849ebd3dcb7463f0.

Reverted https://github.com/pytorch/pytorch/pull/117720 on behalf of https://github.com/eellison due to cause of https://github.com/pytorch/pytorch/issues/118104 ([comment](https://github.com/pytorch/pytorch/pull/117720#issuecomment-1906693119))
2024-01-23 18:40:20 +00:00
364728b27b Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-23 18:39:30 +00:00
5ec2d7959d Revert "[ez] Provide a slightly better error message if process times out (#117865)"
This reverts commit 5538b37a065e5a68c3fb9d1f8eaa3e4fd12fd0b8.

Reverted https://github.com/pytorch/pytorch/pull/117865 on behalf of https://github.com/clee2000 due to Does not play nice with retry_shell, which expects timeoutexpired, but i cant control the error message of that ([comment](https://github.com/pytorch/pytorch/pull/117865#issuecomment-1906640922))
2024-01-23 18:13:41 +00:00
6784594532 Fix sparse windows on CPU with MKL (#102604)
Fix https://github.com/pytorch/pytorch/issues/97352.
This PR changes the way the linking to intel MKL is done and updating MKL on Windows to mkl-2021.4.0 .
There are for both conda and pip packages MKL  version with which you can link dynamically. mkl-devel contains the static versions of the dlls and MKL contains the needed dlls for the runtime. MKL dlls and static libs starting with  2021.4.0 have the version in their names( for MKL 2023 we have mkl_core.2.dll and for 2021.4.0 we have mkl_core.1.dll) so its possible to have multiple versions installed and it will work properly.
For the wheel build, I added dependency for whell MKL and on conda a dependecy for the conda MKL  and on libtorch I copied the MKL binaries in libtorch.
In order to test this PR I have to use custom builder https://github.com/pytorch/builder/pull/1467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102604
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
2024-01-23 17:41:18 +00:00
7598a4efdc [ROCm] Disable MIOpen for empty tensors for RNN (#117672)
Some MIOpen RNN functions (lstm, rnn, gru) can't work with empty tensors and return error "MIOpen Error: Lengths must be > 0"
This PR disables MIOpen tor empty tensors and force to use native methods
The solution is based on condition of using CUDNN 3a52147cc5/aten/src/ATen/native/TensorProperties.cpp (L91)
It also fix [test_nn.py::TestNN::test_RNN_input_size_zero](29fa6fbc4e/test/test_nn.py (L4592)) on ROCM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117672
Approved by: https://github.com/cpuhrsch
2024-01-23 17:30:18 +00:00
0c9b513470 [Export] Fix serialize_metadata (#118031)
Summary: As title.

Test Plan: CI

Differential Revision: D52979069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118031
Approved by: https://github.com/zhxchen17
2024-01-23 17:03:04 +00:00
9ebaa27922 Fix types.MethodDescriptorType related bug in dynamo (#118041)
Methods that were `types.MethodDescriptorType` were failing because the `tensor.method()` to `method(tensor)` conversion was dropping the tensor and just calling `method`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118041
Approved by: https://github.com/yanboliang
ghstack dependencies: #118000
2024-01-23 16:11:38 +00:00
3b38f7b266 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 16:11:38 +00:00
3ec4f00316 [inductor] Allow reinplacing functionalized scatter ops (#116899)
This expands the reinplacing pass to allow reinplacing view-scatter operations.
e.g. if our python code is:
```
a = view1(inp)
b = view2(a)
b.copy_(src)
```
this generates a functionalized graph like:
```python
a = view1(inp)
a_updated = view2_scatter(a, src)
inp_updated = view1_scatter(inp, a_updated)
```

First, the `canonicalize_view_scatter_ops` step rewrites the functionalized graph
in the form:
```python
inp_updated = _generalized_scatter(inp, src, [view1, view2])
a_updated = view1(inp_updated)
```

I then register `_generalized_scatter` as a normal inplacable op which can be
handled by the pre-existing mechanism. Since we've fused the two scatter ops into one,
the reinplacing pass sees only one user of `inp` which allows the entire operation to be
reinplaced  if desired (and I add heuristics that sometimes choose not to reinplace).

Finally, there is a decomposition step which decomposes out-of-place or in-place
`_generalized_scatter` operations either back into view_scatter operations, or
into the version with mutations. When introducing mutations, the reinplaced
version is equivalent to the original mutation:
```
a = view1(inp)
b = view2(a)
b.copy_(src)
```

Or when out-of-place we end up with a minor restructuring of the graph:
```
a = view1(inp)
tmp = view2_scatter(a, src)
inp_updated = view1_scatter(inp, tmp)
a_updated = view1(inp_updated)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116899
Approved by: https://github.com/lezcano
ghstack dependencies: #116898, #117121
2024-01-23 15:31:28 +00:00
5502a63b22 [inductor] Allow reinplacing before meta-only users (#117121)
Currently if you have the code:
```python
idx = torch.arange(10, device=x.device)
src = torch.ones(10, dtype=x.dtype, device=x.device)
x.index_put_((idx,), src)
expand = x.expand((2, x.shape[0]))
```

The `index_put_` cannot be reinplaced under dynamic shapes due to the user
`aten.sym_size(x, 0)` however since this function only looks at the tensor
metadata, it is actually fine to reinplace.

Here I ignore these operators in the analysis of the reinplacing pass, so
reinplacing can happen under dynamic shapes as well. I also handle cases
where views are created just to be fed to `sym_size`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117121
Approved by: https://github.com/lezcano
ghstack dependencies: #116898
2024-01-23 15:31:28 +00:00
eb0fcab421 [inductor] Move reinplace pass to its own file (#116898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116898
Approved by: https://github.com/lezcano
2024-01-23 15:31:28 +00:00
e309d6fa1c Better unsupported op error message (#117770)
Previously, if someone wrote a python abstract impl but didn't import
the module it is in, then we would raise an error message suggesting
that the user needs to add an abstract impl for the operator.

In addition to this, we suggest that the user try importing the module
associated with the operator in the pystub (it's not guaranteed that
an abstract impl does exist) to avoid confusion.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117770
Approved by: https://github.com/ydwu4, https://github.com/williamwen42
2024-01-23 15:05:16 +00:00
4d625c1c92 [AOTI] Fix a bug in the torch._export.aot_load API (#118039)
Summary:
tree_flatten_spec should use args instead of *args

clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes

Test Plan: CI

Differential Revision: D52982401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039
Approved by: https://github.com/angelayi
2024-01-23 14:54:02 +00:00
bff348b28f [AOTI] Add missing include to model.h (#118075)
At lest if one tries to compile the AOTI code on Darwin, compilation
fails with implicit instantiation of undefined template error:
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3:
/Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>'
  std::stringstream ss;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075
Approved by: https://github.com/desertfire
ghstack dependencies: #118074
2024-01-23 14:34:00 +00:00
2963e85a3f [EZ][AOTI] Fix typos (#118074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118074
Approved by: https://github.com/desertfire
2024-01-23 14:34:00 +00:00
ae459c5809 Don't use private accessor on SymNode to get _expr (#118007)
This materially impacts https://github.com/pytorch/pytorch/pull/117862
split this out for testing

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118007
Approved by: https://github.com/tugsbayasgalan
2024-01-23 14:29:19 +00:00
73c9be1395 Don't use private accessor on SymNode to get _expr (round 2) (#118013)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118013
Approved by: https://github.com/tugsbayasgalan
2024-01-23 14:29:12 +00:00
905a7cc340 [ROCm] skip test_eager_transforms.py test_compile_vmap_hessian_cuda (#118009)
Memory leak detected on ROCm.  Skip until it can be addressed.

PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test_eager_transforms.py -k test_compile_vmap_hessian_cuda

See #117642 for moving rocm CI to unstable due to this test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118009
Approved by: https://github.com/jeanschmidt
2024-01-23 09:57:18 +00:00
4cfd16cb6d [Inductor] optimize transpose_mxn with bf16 data type (#117958)
**Summary**
Add the vectorization implementation of `transpose_mxn` with BFloat16 data type when matrix size is 16X16 or 32X32 which observed in Stable Diffusion BF16.

**TestPlan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_16_16_bf16_fp16
python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_32_32_bf16_fp16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117958
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-23 09:43:35 +00:00
40890ba8e7 [CI] Add python test skip logic for XPU (#117621)
Add python test skip logic for XPU

For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now.

Works for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621
Approved by: https://github.com/huydhn
2024-01-23 08:20:42 +00:00
455bba38f4 [C10D] Make Flight Recorder report time_created in ns (#118047)
Addresses (6) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
2024-01-23 08:18:08 +00:00
5df92a9244 [C10D] Add version tag to NCCL Flight Recorder Dump (#118046)
Addresses (3) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
2024-01-23 08:18:08 +00:00
dace1fda2e [C10D] Make NCCL Flight Recorder dump produce a dict (#118044)
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044
Approved by: https://github.com/zdevito
2024-01-23 08:18:08 +00:00
28c8a07b4d add mask_convert_to_lp to support bool->fp16/bf16 convert (#117830)
Fix
https://github.com/pytorch/pytorch/issues/117624
https://github.com/pytorch/pytorch/issues/117627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117830
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-23 07:52:43 +00:00
6049998971 [C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016)
Summary:
Previously, heatbeat was incremented once per finishing a for loop over a list
of in-progress work items, under the assumption that either the processing
would be predictably quick, or it would hang completely.

In fact, there can be cuda API contention that causes the processing of works
to slow down arbitrarily but not truly deadlock.  To guard against this, we
bump the heartbeat at the smallest unit of progress, one work item being
successfully processed.

Test Plan: CI

Differential Revision: D52973948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2024-01-23 07:25:18 +00:00
a8978d3676 [dynamo] Add size(), get_coordinate() support for DeviceMesh in dynamo (#117710)
Summary: This fix is part of: https://github.com/pytorch/pytorch/issues/117670

Test Plan: Unit tetst and CI

Differential Revision: D52857348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117710
Approved by: https://github.com/wconstab, https://github.com/yanboliang, https://github.com/wanchaol, https://github.com/anijain2305
2024-01-23 07:10:52 +00:00
bb28965924 Revert "Remove skips for passing tests (#118000)"
This reverts commit 3c339b5b21fdbd530f82765f84bcabde8266d3e0.

Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))
2024-01-23 06:10:25 +00:00
suo
d84173c025 [export] fix unlifting of custom class constants (#117979)
we didn't have a test covering this case, add one.

Aside: we should invest in actually unit testing the lifting/unlifting passes, both separately and also against each other. I have a diff cooking for that.

Differential Revision: [D52962180](https://our.internmc.facebook.com/intern/diff/D52962180/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117979
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #115222, #117978
2024-01-23 05:51:00 +00:00
suo
7b0979ef8e [export] fixes to unflatten + custom obj composition (#117978)
The test I added for this didn't actually enable torchbind tracing, oops. Fix that and fix the issues that cropped up.

Differential Revision: [D52962205](https://our.internmc.facebook.com/intern/diff/D52962205/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117978
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #115222
2024-01-23 05:50:41 +00:00
e056cf5507 [ac][pattern matcher] Do not percolate tags beyond the inputs of matched portion (#118034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118034
Approved by: https://github.com/yf225
2024-01-23 05:02:32 +00:00
3708f2608e [DTensor] Skip [add]mm empty tensor test (#118045)
As DTensor does not support multiplication of [4,0] and [0,4] matrices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118045
Approved by: https://github.com/yf225, https://github.com/wanchaol
2024-01-23 04:08:11 +00:00
0036385b55 [Inductor][Reliability] Add runtime numeric check for pt2 Optimus in the pre grad pass (#115142)
Summary: Titled

Test Plan:
# local reproduce
Patch ``icfg.fx_passes_numeric_check["pre_fx_passes"] = True"
```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
P965217137

# MC candidates
### FIRST + CMF
f520754604
P1056796962
### ICVR
f520816217
P1056839342
### IG_CTR
f520819178
P1056903302
### MAI
f520823559
P1057712009
### AFOC
f520822438
P1057760058
### DPA
f520826815
P1057808574
### How the runtime numeric check to catch [SEVs](https://docs.google.com/document/d/1WOtlbgCBbmU1klK1LiGSO0lYf_7mtSP4nAdvhQHM0JE/edit#heading=h.k61fy2rhaijp)
bug fix diff: D51378532
### CMF+(FIRST)
f509587388
P1058305139
by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058293804))
https://pxl.cl/4bQDG

f501760099
P1058400691
by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058412054))
https://pxl.cl/4bQMw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115142
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2024-01-23 03:56:50 +00:00
3c339b5b21 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 03:41:23 +00:00
4646d0e1b2 Update xla.txt (#117999)
XLA CI is currently broken in PyTorch, I think there are 2 reasons causing that
1. There is an offending Pytorch PR c393b2f1ee. Han is working on a fix in https://github.com/pytorch/xla/pull/6345
2. Commit that pytorch pin to 2990cb38c17e06d0dbe25437674ca40130d76a8f was not a valid commit. I think this is because we tried to help them to land a breaking pr in https://github.com/pytorch/xla/pull/6307. However I think we did a rebase which vanish that commit. now the CI failed
```
fatal: reference is not a tree: 2990cb38c17e06d0dbe25437674ca40130d76a8f
585
```
Let me first update the pin to the master so it at least run some test, this way we can discover if there is any additional issue. I will rebase after @qihqi 's fix passed all CI

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117999
Approved by: https://github.com/clee2000
2024-01-23 03:36:32 +00:00
fed45aee54 Replace invoking self.value if there is a user defined init, avoiding arbitrary code execution (#117818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117818
Approved by: https://github.com/ezyang
2024-01-23 03:14:58 +00:00
dc1b9d758e Update passrate calculation script to skip inductor and export (#118030)
We don't want to count running test/inductor/ and test/export/ with
PYTORCH_TEST_WITH_DYNAMO=1 as a part of the pass rate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118030
Approved by: https://github.com/ydwu4
ghstack dependencies: #117998
2024-01-23 02:33:57 +00:00
162f643090 Script to generate failures histogram (#118008)
Generates something that looks like
https://gist.github.com/zou3519/43aa8ef28a327bd68cfbac83d84c0999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118008
Approved by: https://github.com/yanboliang, https://github.com/oulgen
2024-01-23 02:28:55 +00:00
af7cd5c32a [Dynamo] Install module globals per output_graph (#117998)
Fixes https://github.com/pytorch/pytorch/issues/117851

In tests, we ran into an issue where:
- In frame A, Dynamo would install a global
- We call reset()
- reset() did not delete the installed global due to a refcycle
- In frame B, Dynamo would re-use the same global
- Python gc ran, deleting the installed global, leading to the compiled
  version of frame B raising NameNotFound

This PR changes the following:
- module globals are now installed at a per-frame basis.
- renames install_global to install_global_unsafe: if the names are not
  unique and end up being re-used across frames, then we've got trouble.

Test Plan:
- I tested that this got rid of the test flakiness locally. I'm not sure
  how to easily write a test for this, because I don't actually know
  what the refcycle in the above is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117998
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-01-23 02:28:02 +00:00
a85fd20d45 [ONNX] Improve support to mmap for ONNXProgram.save (#117863)
Currently, when the user passes a model state_dict which is not a file,
ONNXProgram.save calls torch.save along with io.BytesIO, which does not
support memory-map. That makes the file stream to be fully allocated in
memory.

This PR removes the torch.save call and passes the dict directly to the
serializer. this is beneficial for the scenario when model_state_dict
is generated by torch.load(..., mmap=True) as the state dict will be
mappped in memory instead of fully loaded in memory.

This PR leverages https://github.com/pytorch/pytorch/pull/102549
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117863
Approved by: https://github.com/wschin
2024-01-23 02:00:00 +00:00
052860294f Replace constraints with dynamic_shapes in export-to-executorch tutorial (#117916)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in export-to-executorch tutorial.

Test Plan: CI

Differential Revision: D52932772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117916
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2024-01-23 01:17:19 +00:00
d810b10232 Add beta1 support to CyclicLR momentum (#113548)
Fixes #73910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113548
Approved by: https://github.com/janeyx99
2024-01-23 01:16:58 +00:00
d01ba4e94e enable fp8 cast for inductor CPU (#117737)
Enable FP8 cast for this issue https://github.com/pytorch/pytorch/issues/117119.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117737
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-23 01:16:15 +00:00
d8420c0b0c [Nested Tensor]Add helper functions to set max_seqlen/min_seqlen directly (#117815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117815
Approved by: https://github.com/soulitzer
2024-01-23 01:00:45 +00:00
a27a6e8cf1 [ROCm] skip test_sparse_csr test_triton_bsr_softmax_cuda (#118006)
The tests were taking too long and leading to CI timeouts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118006
Approved by: https://github.com/huydhn
2024-01-23 00:09:42 +00:00
c6be5d55a5 Migrate param_group testing to OptimizerInfo (#117675)
Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize.

This PR introduces two tests to encompass coverage:
1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case.
2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does.

Together, these tests do a better job of testing param groups than today's tests, **though we do lose some flavors**. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees.

The leftover param group configs are used in conjunction with LRSchedulers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675
Approved by: https://github.com/albanD
2024-01-22 23:48:46 +00:00
d280b6ae58 Ensure that deleter is called even for a no-data tensor. (#117418)
Summary:

When using a custom deleter InefficientStdFunctionContext was using a
std::unique_ptr<> to store the pointer and call the deleter - but this failed to
call the deleter if the pointer was null. Since we have a separate holder class
anyway take out the std::unique_ptr<> and call the deleter directly.

Fixes #117273

Test Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418
Approved by: https://github.com/wjakob, https://github.com/yanboliang
2024-01-22 23:27:27 +00:00
cef5b93f28 [ez] Serial when NUM_PROCS is 1 (#117977)
Makes it easier to understand whats going on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977
Approved by: https://github.com/huydhn
2024-01-22 23:11:41 +00:00
f9fca33baf [codemod][highrisk] Fix shadowed variable in caffe2/caffe2/onnx/onnx_exporter.cc (#117996)
Summary:
Our upcoming compiler upgrade will require us not to have shadowed variables. Such variables have a _high_ bug rate and reduce readability, so we would like to avoid them even if the compiler was not forcing us to do so.

This codemod attempts to fix an instance of a shadowed variable. Please review with care: if it's failed the result will be a silent bug.

**What's a shadowed variable?**

Shadowed variables are variables in an inner scope with the same name as another variable in an outer scope. Having the same name for both variables might be semantically correct, but it can make the code confusing to read! It can also hide subtle bugs.

This diff fixes such an issue by renaming the variable.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: igorsugak

Differential Revision: D52582853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117996
Approved by: https://github.com/PaliC, https://github.com/kit1980, https://github.com/malfet
2024-01-22 22:57:06 +00:00
b901999350 [inductor] For View.create(x, sizes) call realize_input() instead of realize() when handling unbacked symints (#117013)
# Context
Let's say we do `View.create(x, sizes)` where `x` is a `SliceView` and `sizes` contains unbacked symints e.g. `sizes = [i14, 256]`. Then, this we'll run ([this code](7e37f63e5e/torch/_inductor/ir.py (L2058-L2071))) where we.
1. Call `x.realize()` -- SliceView(Pointwise) -> SliceView(ComputedBuffer).
2. Retrieve storage & layout via `as_storage_and_layout(x)`
3. Calculate `new_layout` based off layout & `new_sizes`
3. `return ReinterpretView(storage, new_layout)`
However, (2) will raise `NotImplementedError` ([see](7e37f63e5e/torch/_inductor/ir.py (L1704-L1731))) since `x` is a `SliceView` and that isn't supported.

Thus, I tried adding support for `SliceView` in `as_storage_and_layout`. This worked for my case, but if instead `sizes` had backed symints e.g. `sizes=[s0, 256]` then some benchmarked models lost accuracy.
```
    if isinstance(x, SliceView):
        return as_storage_and_layout(
            x.data,
            freeze=freeze,
            want_contiguous=want_contiguous,
            stride_order=stride_order,
        )
```

So instead of the above, I tried unwrapping the `SliceView` via `x = x.unwrap_view()`. This works for my usecase and passes CI but I'm not entirely sure why. If we unwrap our `SliceView` and create a `ReinterpretView`, I'd assume we'd lose the reindexer from `SliceView`. ~~But maybe we can re-create the same indexing from the `ReinterpretView`'s strides?~~ edit: we do lose vital information (like offset) when you release your `SliceView` and create a `ReinterpretView` so that's a no-go.

Moving onto the final version of this PR. We call `ExternKernel.realize_input()` (feels a bit weird to use `ExternKernel` but it's exactly what I need). It will go ahead and handle our `SliceView` case ([see](a468b9fbdf/torch/_inductor/ir.py (L3733-L3739))) by converting it to a `ReinterpretView` with the correct offset.

# Test
```
$ python test/inductor/test_unbacked_symints.py
..
----------------------------------------------------------------------
Ran 10 tests in 20.813s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117013
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-01-22 22:34:10 +00:00
f96b7d06d7 [export] skip export tests when test with dynamo in ci (#117988)
Fixes https://github.com/pytorch/pytorch/issues/117947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988
Approved by: https://github.com/suo, https://github.com/zou3519
2024-01-22 22:14:36 +00:00
c14751b6cf Remove extraneous [[fallthrough]] in ivalue.cpp (#117985)
Test Plan: Sandcastle

Differential Revision: D52963965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117985
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-01-22 21:54:39 +00:00
b5799d9977 Revert "[c10d] Barrier uses stream sync instead of device sync (#117804)"
This reverts commit 0f6bbb1c070c3a9713893659377e20e147c125f6.

Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788.  Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))
2024-01-22 21:54:03 +00:00
792dfa7e16 Allow dynamic shapes of tuple type for inputs of dataclass type (#117917)
Summary:
In `torch.export.export(f, args, kwargs, ..., dynamic_shpapes=None, ...)`, `dataclass` is an acceptable type of inputs (for args and kwargs). The `dynamic_shapes` of the `dataclass` inputs needs to be the same `dataclass` type which replaces each tensor attributes with `dynamic_shapes` of the corresponding tensors. (https://github.com/pytorch/pytorch/blob/main/torch/export/dynamic_shapes.py#L375)

However, some `dataclass` may have limitations on the types of attributes (e.g., having to be tensors) such that the same `dataclass` cannot be constructed for dynamic shapes.

For an input of `dataclass` type, this task enables a `dynamic_shapes` of a tuple type that specifies dynamic shape specifications for each tensor of the input in the same order as the input dataclass type's flatten_fn (https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py#L103)

Test Plan: buck test //caffe2/test:test_export

Differential Revision: D52932856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117917
Approved by: https://github.com/avikchaudhuri
2024-01-22 21:50:28 +00:00
4df65bf51b Optimize recursive_add_node in fx splitter (#117969)
Summary: The `FxNetAccFusionsFinder.recursive_add_node` function can run into an exponential complexity when applied to an fx graph with multiple densely connected layers of nodes. Here we add a `visited` set which reduces the worst case complexity to linear.

In the internal MRS models with the densely connected layer structure, this fix reduces the fx split time from forever to < 100ms, hence unblocking the internal enablement.

P.S. As much as I want to add a unit test, I can't find any existing tests for the `_SplitterBase` infra. Happy to add one if pointed to where. Thanks!

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D52951321](https://our.internmc.facebook.com/intern/diff/D52951321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117969
Approved by: https://github.com/oulgen, https://github.com/khabinov
2024-01-22 21:49:36 +00:00
86e8551446 [dtensor] switch softmax forward ops to OpStrategy (#117723)
**Summary**
This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation.

**Test**
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd`
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723
Approved by: https://github.com/XilunWu
2024-01-22 21:26:48 +00:00
fdac55c35d Added example regarding weight_decay distinction with per-parameter API (#117436)
Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms.

Fixes #115935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436
Approved by: https://github.com/janeyx99
2024-01-22 21:26:02 +00:00
b14d57ceda Replace constraints with dynamic_shapes in scripts/sijiac/prototypes and test/inductor (#117915)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `scripts/sijiac/prototypes` and `test/inductor`.

Test Plan: buck test mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor

Differential Revision: D52931743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117915
Approved by: https://github.com/angelayi
2024-01-22 21:24:03 +00:00
95a6866220 Migrate fused optim load_state_dict to OptimizerInfo (#117890)
The new tests look like:

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (29f899ef)]$ python test/test_optim.py -v -k test_cpu_load_state_dict
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
test_cpu_load_state_dict_impl_capturable_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_capturable_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_capturable_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_fused_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_fused_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_fused_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_capturable_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_capturable_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_capturable_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... skipped 'SGD does not currently support capturable'
test_cpu_load_state_dict_impl_fused_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_fused_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_fused_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 12 tests in 12.865s

OK (skipped=6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117890
Approved by: https://github.com/albanD
2024-01-22 21:14:38 +00:00
9a2c8f644b Mark DynamicShapesExportTests::test_retracibility_dynamic_shapes as slow (#117896)
Mark `dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dynamic_shapes` explicitly as slow

I cannot figure out what the correct way to do this is

Tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117896
Approved by: https://github.com/huydhn
2024-01-22 21:12:03 +00:00
903e1913ff Rename unbacked SymInt prefix to u (#117859)
Currently, it conflicts with Inductor's naming convention for index
variables

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117859
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/avikchaudhuri
2024-01-22 20:53:47 +00:00
0f6bbb1c07 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-22 20:14:51 +00:00
c170fbd309 [dtensor] refactor redistribute and fix uneven sharding redistribution (#115525)
This PR:
- refactors the redistribute implementation logic to make it more
sound, by figuring out the transform informations first and then apply
transformation step by step, we also cache the decisions so that it
could be reuse again
- for uneven sharding, refactor uneven sharding logic, and use a logical
  shape concept for each transform information to fix the uneven sharding
  multi-mesh redistribute bug

fixes https://github.com/pytorch/pytorch/issues/115310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115525
Approved by: https://github.com/XilunWu
2024-01-22 18:57:44 +00:00
2bb2cc0b71 [tp] add clarification to doc and improve TP examples (#117618)
This PR adds a clarification about evenly sharded assumption in the main
tp doc and improved the tp examples by adding device mesh constructions

fixes https://github.com/pytorch/pytorch/issues/100044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-01-22 18:56:50 +00:00
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
56ef5afdee [dynamo] Add more dynamo call_methods and getattr support or Placement (#117733)
Summary:
Explained by title.
This fix is part of: https://github.com/pytorch/pytorch/issues/117670

Test Plan:
Unit tetst and CI
- Unit test: `buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:dtensor_compile -- test_placement_compile`

Differential Revision: D52863073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117733
Approved by: https://github.com/yanboliang
2024-01-22 18:22:54 +00:00
suo
f612e96180 [export] set proper fqn in lift constant tensor pass (#115222)
See comments: previously we were populating the lifted constant in the buffer list without an FQN, which messed up unflattening.

Differential Revision: [D50568062](https://our.internmc.facebook.com/intern/diff/D50568062/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115222
Approved by: https://github.com/tugsbayasgalan
2024-01-22 18:13:49 +00:00
80cf0ce153 Enhance torch.vmap support from inside torch.compile (#116050)
This work rewrites vmap support in torch.compile by inlining most of
the frames into the existing FX graph. It also unlocks to PyTorch to
support features that were previously missing, such as keyword args.

Fixes: https://github.com/pytorch/pytorch/issues/114306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050
Approved by: https://github.com/zou3519
2024-01-22 17:53:45 +00:00
b2a3d6ba0d [exportdb] Remove torch/fb/exportdb (#117866)
Summary: This has already been moved to torch/_export/db

Test Plan: no tests? I think?

Reviewed By: avikchaudhuri

Differential Revision: D52875607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117866
Approved by: https://github.com/ydwu4
2024-01-22 17:41:33 +00:00
a359afbc3f Make and/or on uint8 tensors properly return 0x00 or 0x01 (#117827)
Fixes https://github.com/pytorch/pytorch/issues/117215

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117827
Approved by: https://github.com/albanD
2024-01-22 17:30:22 +00:00
c6c54df81b Fix incorrect type hints of Module.to (#117937)
Fixes #117936

While #113647 fixed the issue that `device` did not accept strings, it did not get the type hints fully correct. This PR removes the `str` variants from the type hints for the `dtype` parameter(s) in all overloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117937
Approved by: https://github.com/albanD
2024-01-22 16:47:30 +00:00
60519fa3b7 change master to main in datapipes readme (#117919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117919
Approved by: https://github.com/albanD
2024-01-22 16:29:41 +00:00
86b4b27e26 [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/awgu
2024-01-22 15:46:35 +00:00
8dc421a6b4 Revert "accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)"
This reverts commit 03b12e56c758431df6f95075ce3a0113ccaeb3f9.

Reverted https://github.com/pytorch/pytorch/pull/115539 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/115539#issuecomment-1904157729))
2024-01-22 14:48:35 +00:00
cyy
c3780010a5 Remove calls of c10::guts::void_t (#117942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117942
Approved by: https://github.com/Skylion007
2024-01-22 06:12:37 +00:00
3580e5d407 [executorch hash update] update the pinned executorch hash (#117953)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117953
Approved by: https://github.com/pytorchbot
2024-01-22 04:34:44 +00:00
cyy
39df084001 [Clang-tidy header][16/N] Enable clang-tidy on headers in torch/csrc/autograd (#117821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117821
Approved by: https://github.com/Skylion007
2024-01-22 00:52:56 +00:00
cyy
3baade4425 Remove calls of c10::guts::conjunction,c10::guts::disjunction,c10::guts::negation (#117926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117926
Approved by: https://github.com/Skylion007
2024-01-22 00:35:42 +00:00
02209b5880 Revert "[docs] start a new FSDP notes doc (#117323)"
This reverts commit 7f474da6bcc735cde5ef1417dc28472769307f5d.

Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))
2024-01-21 19:47:27 +00:00
suo
c393b2f1ee [export] require Module to be passed to export (#117528)
This PR changes torch.export to require an nn.Module as input, rather than taking an arbitrary callable.

The rationale for this is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function:
1. We "guarantee" that every call_function node has an `nn_module_stack` populated.
2. We offer ways to access the state_dict/parameters/buffers of the exported program.

We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model.

An alternative design would be to implicitly convert the top-level function into a module, rather than require that the user provide a module. I think that's reasonable (it's what we did in TorchScript), but in the spirit of being explicit (another design tenet of export) I avoid that here.

Differential Revision: [D52789321](https://our.internmc.facebook.com/intern/diff/D52789321/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117528
Approved by: https://github.com/thiagocrepaldi, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2024-01-21 19:36:13 +00:00
3ee092f75b VSX: Fix overflow in complex division (#116972)
For large complex values the division produces inf or NaN values which leads other functions to produce such too,
e.g. `torch._refs.sgn` used in a test.
Example:
```
$ python -c 'import torch; print(torch._refs.sgn(torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32))))'
tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj])

$ python -c 'import torch; t = torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32)); print(t / t.abs())'
tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj])
```
Implement the same algorithm as used in numpy and x86 (#93277)

Reason here is that for a tensor with a component of `1e20` the abs-squared value used in the division contains a term `1e20 * 1e20` which overflows the dynamic range of float32 (3e38) and yields an "inf", so the division yields "nan"

Output after change:
```
$ python -c 'import torch; t = torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32)); print(torch._refs.sgn(t), t.sgn(), t / t.abs())'
tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j])
```

CC @quickwritereader who wrote the initial code and @VitalyFedyunin who was involved in the initial review and @lezcano who reviewed #93277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116972
Approved by: https://github.com/lezcano
2024-01-21 19:21:13 +00:00
afabed6ae6 [inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
fixes #116715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298
Approved by: https://github.com/eellison
2024-01-21 18:47:01 +00:00
41556324a9 [cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693)
Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation.

Result:
```
python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper

Before: Ran 1 test in 32.843s
After: Ran 1 test in 26.229s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693
Approved by: https://github.com/jansel
2024-01-21 16:07:52 +00:00
7f474da6bc [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/albanD, https://github.com/awgu
2024-01-21 15:11:24 +00:00
b50ccad86e [BE]: Add type alias typing annotation to prims_common (#117928)
Explicitly mark unions assignments as type aliases to make it easier for static type checkers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117928
Approved by: https://github.com/ezyang
2024-01-21 14:26:59 +00:00
df4e3d9d08 Document OpsHandler protocol (#117790)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117790
Approved by: https://github.com/jansel
2024-01-21 07:20:53 +00:00
eqy
8f7caaee67 [cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908)
Remove the unnecesssary dependence on assuming a fixed number of digits per field

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908
Approved by: https://github.com/cpuhrsch
2024-01-21 05:00:01 +00:00
fbd1d567ed [inductor] Fix CPP wrapper codegen for ExternKernel args (#117931)
Summary: We see IR nodes `repr`-ed directly in the CPP wrapper codegen. Recently, this issue has been fixed for the Python wrapper codegen in D52899373 (https://github.com/pytorch/pytorch/pull/117838). Here we extend the fix to CPP wrapper codegen / AOTInductor.

Test Plan:
New unit tests. In OSS:

```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_multi_output_arg
```

```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_extern_kernel_arg
```

Differential Revision: D52936248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117931
Approved by: https://github.com/oulgen, https://github.com/chenyang78, https://github.com/desertfire
2024-01-21 04:58:56 +00:00
fa1e89b337 Ban mutation on dropout outputs in export (#117879)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117879
Approved by: https://github.com/ezyang
ghstack dependencies: #117811
2024-01-21 04:53:40 +00:00
949a76a7f0 [executorch hash update] update the pinned executorch hash (#117899)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117899
Approved by: https://github.com/pytorchbot
2024-01-21 04:19:27 +00:00
suo
2ae66ddba0 [export] fix test ownership (#117886)
as title

Differential Revision: [D52924188](https://our.internmc.facebook.com/intern/diff/D52924188/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117886
Approved by: https://github.com/ydwu4
2024-01-21 01:18:16 +00:00
bad5e1e0bb [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op hardswish (#117489)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering to `hardswish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_hardswish
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117489
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #117487, #117488
2024-01-21 00:01:32 +00:00
05ef2030ea [c10d] Add logs for NCCL Comm Abort call (#117868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868
Approved by: https://github.com/kwen2501
2024-01-20 21:34:13 +00:00
2de3474711 Simplify kwargs propagation in __call__. (#117880)
In case no keyword arguments are passed, `**kwargs` would expand just fine without the need for extra overhead of `or {}`. In addition to reducing boilerplate, this also comes with a small perf improvement:
```
In [1]: def null(*args, **kwargs):
   ...:     pass
   ...:

In [2]: def call1(*args, **kwargs):
   ...:     return null(*args, **(kwargs or {}))
   ...:

In [3]: def call2(*args, **kwargs):
   ...:     return null(*args, **kwargs)
   ...:

In [4]: %timeit call1()
145 ns ± 2.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: %timeit call2()
118 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: %timeit call1()
147 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: %timeit call2()
117 ns ± 0.846 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117880
Approved by: https://github.com/Skylion007
2024-01-20 19:29:35 +00:00
50633620b2 sympy.Symbol is a subclass of sympy.Expr (#117857)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117857
Approved by: https://github.com/peterbell10
2024-01-20 18:09:44 +00:00
af831415a8 fix cpp backend relu codegen with inf input (#117622)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/117544.
For CPP backend, current `ReLU` will code gen to `f"{x} * ({x}>0)"` in `CppOverrides`. The result mismatches with eager when input has `inf`, since `inf * 0` will result to `nan` based on [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754). Change the code gen to `f"std::max({x}, decltype({x})(0))"` to align with eager implementation as in 1deb75b584/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L392)

**TestPlan**
```
python -u -m pytest test_cpu_repro.py -k test_relu_with_inf_value
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117622
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-20 13:28:03 +00:00
4bf481fb1b Fix inductor pattern match error for qlinear with bmm (#117633)
Summary:

PR https://github.com/pytorch/pytorch/pull/116599 convert `bmm` when input dim exceeds 2 and not contiguous to `qlinear`. However, there is an error when check weight size because of not considering the permute op.

Test Plan:
python test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous

Fixes: -

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117633
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-01-20 12:26:26 +00:00
0ae952db76 enable mkldnn bf32 matmul (#116015)
### Testing
FP32 matmul vs. mkldnn BF32 matmul  on SPR

single core:

Input | BF32   / ms | FP32  /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041

56 cores:
Input | BF32   / ms | FP32 /   ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 |  1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915  | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 |  8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128  | 1296.419 | 1308.411 | 1.009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-20 09:30:23 +00:00
aaae2d8bb6 Add compilable and capturable foreach adamax with tests (#117835)
Based off of https://github.com/pytorch/pytorch/pull/110345

Fixes https://github.com/pytorch/pytorch/issues/117812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835
Approved by: https://github.com/janeyx99
2024-01-20 05:29:05 +00:00
suo
e732adf0a7 [pytree] add access api (#117771)
This PR introduces an API to use KeyPaths to actually access values on pytrees.

Differential Revision: [D52881260](https://our.internmc.facebook.com/intern/diff/D52881260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117771
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2024-01-20 04:03:26 +00:00
a1b3b5748f [Pytoch][Vulkan] Create context for conv1d (#117780)
Summary:
`conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we
- created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`

Test Plan:
## Numerical test
```
[luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef
Network: Up: 0B  Down: 0B
Jobs completed: 4. Time elapsed: 0.6s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (159 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (57 ms)
[----------] 2 tests from VulkanAPITest (217 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (217 ms total)
[  PASSED  ] 2 tests.
```

Full test result in P1053644934, summary as below
```
[----------] 419 tests from VulkanAPITest (28080 ms total)
[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (28080 ms total)
[  PASSED  ] 418 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
## Graph representation comparison
We created a model using `conv1d` and traced it as below
```
# Define a simple model that uses conv1d
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1d = nn.Conv1d(16, 33, 3)

    def forward(self, x):
        return self.conv1d(x)

# Create an instance of the model
model = MyModel()

# Create a dummy input tensor for tracing
input_tensor = torch.randn(20, 16, 50)

# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer

vulkan_model = mobile_optimizer.optimize_for_mobile(
    traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)`
- before this diff: `conv1d` was used
```
graph(%self.1 : __torch__.___torch_mangle_16.MyModel,
      %x : Tensor):
  %60 : Device = prim::Constant[value="cpu"]()
  %self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %37 : bool = prim::Constant[value=0]()
  %36 : NoneType = prim::Constant()
  %59 : Device = prim::Constant[value="vulkan"]()
  %self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0
  %18 : int[] = prim::Constant[value=[1]]()
  %19 : int[] = prim::Constant[value=[0]]()
  %39 : Tensor = aten::to(%x, %59, %36, %37, %37)
  %20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7)
  %58 : Tensor = aten::to(%20, %60, %36, %37, %37)
  return (%58)
```
- after this diff: `conv1d` was replaced with `run_conv1d_context`
```
graph(%self.1 : __torch__.___torch_mangle_6.MyModel,
      %x : Tensor):
  %85 : Device = prim::Constant[value="cpu"]()
  %51 : bool = prim::Constant[value=0]()
  %50 : NoneType = prim::Constant()
  %84 : Device = prim::Constant[value="vulkan"]()
  %53 : Tensor = aten::to(%x, %84, %50, %51, %51)
  %prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1)
  %22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0)
  %83 : Tensor = aten::to(%22, %85, %50, %51, %51)
  return (%83)
```

Differential Revision: D52865379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780
Approved by: https://github.com/yipjustin
2024-01-20 02:35:32 +00:00
10923f8720 Revert "[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)"
This reverts commit 1967394690f144a7ba1717eccec977286cafe2da.

Reverted https://github.com/pytorch/pytorch/pull/117298 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing in MacOS 1967394690, may be due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/117298#issuecomment-1901594120))
2024-01-20 02:14:58 +00:00
94f0472579 [Quant] [PT2] Add Hardswish into X86InductorQuantizer Conv2d Unary Annotation (#117488)
**Summary**
Add `hardswish`  into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117488
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #117487
2024-01-20 01:37:33 +00:00
1967394690 [inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
fixes #116715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298
Approved by: https://github.com/eellison
2024-01-20 01:37:28 +00:00
181e6dafd0 [MPS] Fix lintear for 5D tensors (#117837)
torch.nn.Linear crashes with internal assert if invoked with 5D tensors,
due to the bug in MPS framework, i.e. invoking
```swift
import MetalPerformanceShadersGraph

let graph = MPSGraph()
let x = graph.constant(1, shape: [2, 1, 2, 1, 2], dataType: .float32)
let y = graph.constant(1, shape: [2, 3], dataType: .float32)
let z = graph.matrixMultiplication(primary: x, secondary: y, name: nil)
let device = MTLCreateSystemDefaultDevice()!
let buf = device.makeBuffer(length: 48)!
let td = MPSGraphTensorData(buf, shape: [2, 1, 2, 1, 3], dataType: .int32)
let cmdBuf = MPSCommandBuffer(from: device.makeCommandQueue()!)
graph.encode(to: cmdBuf, feeds: [:], targetOperations: nil, resultsDictionary: [z:td], executionDescriptor: nil)
cmdBuf.commit()
```
crashes with
```
AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayIdentity.mm:813: failed assertion `New volume: 4 should match old volume: 8 [reshapeWithCommandBuffer] MPSNDArrayIdentity.'
zsh: abort      ./build/matmul
```

Workaround the issue by flattening the forward and backward tensors if number of dimentions is greater than 4

Add regression tests to Linear opinfo samples

Fixes https://github.com/pytorch/pytorch/issues/114942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117837
Approved by: https://github.com/janeyx99
2024-01-20 01:19:19 +00:00
d4cc1c5bff Add new pattern matchers for SDPA (#113004)
Add two new pattern matchers to enable SDPA in more models.

- Pattern 14: `BertLarge`
- Pattern 15: `DistilBert`

Perf on SPR:

<img width="1007" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/f0813343-c9e8-4fd4-9fa0-d0e67e1d57af">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113004
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-01-20 00:46:46 +00:00
8f91a53e9a Add environment for close-nonexistent-disable-issues (#117885)
Made a new environment called rockset-read-only that has a read only api key for rockset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117885
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-01-19 23:45:46 +00:00
3c1498d117 [ONNX] Add bfloat16 support for scaled_dot_product_attention (#117878)
Using ONNX opset 14, the aten scaled_dot_product_attention oeprator can be implemented with bfloat16 support because Add-14 does support bfloat16

This PR simply add bfloat16 to the list of supported types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117878
Approved by: https://github.com/BowenBao
2024-01-19 23:24:44 +00:00
f684e44fd6 Revert "Reduce pytest prints (#117069)"
This reverts commit 40dbd567e04483c671f9c897171bf9d1e7162b68.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))
2024-01-19 23:07:51 +00:00
5538b37a06 [ez] Provide a slightly better error message if process times out (#117865)
Just a slightly clearer error message
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117865
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-19 22:58:00 +00:00
29f899ef87 [pytorch][vulkan] cumsum dim <= 1 (#117580)
Summary:
Following the implementation of Softmax, striding over the texture differently based on the desired dimension.

Softmax performs a similar operation as cumsum (generally called "scan") iterating over all items in a dimension, but cumsum only needs to iterate once to collate the sum, compared to softmax which needs to iterate multiple times to collect the max and denominator for the final calculation.

Similar to the softmax implmentation there's likely opportunities to optimize, but this gets all dims < 4 functional first.

Test Plan:
`LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*cumsum*"`:
```
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *cumsum*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.cumsum_1d
[       OK ] VulkanAPITest.cumsum_1d (93 ms)
[ RUN      ] VulkanAPITest.cumsum_2d
[       OK ] VulkanAPITest.cumsum_2d (74 ms)
[ RUN      ] VulkanAPITest.cumsum_3d
[       OK ] VulkanAPITest.cumsum_3d (105 ms)
[ RUN      ] VulkanAPITest.cumsum_4d
[       OK ] VulkanAPITest.cumsum_4d (73 ms)
[----------] 4 tests from VulkanAPITest (346 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (346 ms total)
[  PASSED  ] 4 tests.
```

Differential Revision: D52814000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117580
Approved by: https://github.com/yipjustin
2024-01-19 21:52:48 +00:00
dd6c0f6844 Trim Dynamo shards 7->3 (#117869)
We added all of the tests we wanted for now. These fit comfortably in 3
shards (the total test time previously was 0.5 hours on each shard).
Going to decrease the number of shards to 3 so that it's less unwieldy
to work with.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117869
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-19 21:48:35 +00:00
365c7a292f Log stack trace of mutated idx (#117720)
Log stack trace of mutated tensor that prevents cudagraphs. Will do some subsequent refactors when all of the checks are moved to this fashion.

Differential Revision: [D52896588](https://our.internmc.facebook.com/intern/diff/D52896588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117720
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117823
2024-01-19 21:38:44 +00:00
6c99bf0766 move disable_cudagraph_reason disabling after codecache is accessed (#117823)
Disabling cudagraphs has to happen after a codecache loading or it wont properly be disabled on a cache hit.

Differential Revision: [D52896590](https://our.internmc.facebook.com/intern/diff/D52896590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117823
Approved by: https://github.com/bdhirsh, https://github.com/masnesral
2024-01-19 21:33:25 +00:00
c4eab49ded [MacOS] Embed libomp.dylib/omp.h into MacOS wheel (#114816)
To keep them on par with what we do on x86
And `omp.h` as it is needed for `torch.compile` on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114816
Approved by: https://github.com/atalman
2024-01-19 21:21:33 +00:00
414a1fd29f [PyTorch] Add IValue::IValue(std::vector<T>&&) ctors (#117769)
There are two IValue constructors that take `const std::vector<T>&`. Add moving variants to allow callers to save on reference counting.

Differential Revision: [D52879065](https://our.internmc.facebook.com/intern/diff/D52879065/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117769
Approved by: https://github.com/suo, https://github.com/Skylion007
2024-01-19 21:11:11 +00:00
d45fd68012 OIDC for update_pytorch_labels (#117876)
Companion: https://github.com/pytorch-labs/pytorch-gha-infra/pull/339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117876
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-19 21:08:28 +00:00
ad3d41692e [PyTorch] return decltype(auto) from getItem (#117569)
This allows getItem to take advantage of the nicer (sometimes-const-reference) return type from `List::get() const` added in the previous diff.

Differential Revision: [D52809097](https://our.internmc.facebook.com/intern/diff/D52809097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117569
Approved by: https://github.com/iseeyuan, https://github.com/malfet
ghstack dependencies: #117568
2024-01-19 21:04:53 +00:00
632fcc4831 [PyTorch] Make List::get() const match List::operator[]() const (#117568)
As far as I can tell, `get()` is supposed (and documented) to be the same as a const `operator[]`. We have an efficient implementation for `operator[]`. Let's use it for `get()`.

Differential Revision: [D52809098](https://our.internmc.facebook.com/intern/diff/D52809098/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117568
Approved by: https://github.com/suo, https://github.com/malfet
2024-01-19 21:04:53 +00:00
15d568d621 [Inductor] Use codegen reference for buffer to string (#117838)
Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead.

Test Plan: added new test

Differential Revision: D52899373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838
Approved by: https://github.com/aakhundov
2024-01-19 20:18:53 +00:00
1f5c27eb18 cleanup code comments _compute_numerical_gradient (#117484)
cleanup code comments for ` _compute_numerical_gradient`:
- reference parameters passed
- indicate that central difference approximation is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117484
Approved by: https://github.com/soulitzer
2024-01-19 18:51:52 +00:00
ab216bbaeb cleanup code comments analytical Jacobian as vjp projection (#117483)
Cleanup code comments for `_compute_analytical_jacobian_rows` to make clear Jacobian is computed by standard basis vector projections using the vector-Jacobian-product operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117483
Approved by: https://github.com/soulitzer
2024-01-19 18:50:26 +00:00
40dbd567e0 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-19 18:42:12 +00:00
2f4456a73e Remove xfail on test_make_weak_keyed_dict_from_weak_keyed_dict (#117848)
Based on the logs, this test has been consistently passing, so we remove
the xfail.

Fixes https://github.com/pytorch/pytorch/issues/116765
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117848
Approved by: https://github.com/Skylion007
ghstack dependencies: #117765
2024-01-19 18:05:30 +00:00
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e13624998f2a4de29bce73a949d7f0339ec04e.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
f316c35a34 [export] Support preserving submodule callling convention in non-strict export (#117796)
Summary: Title

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D52889236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117796
Approved by: https://github.com/angelayi
2024-01-19 17:16:45 +00:00
249a226113 [export] Error on not pytree-flattened nodes (#117598)
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".

The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
2024-01-19 17:13:39 +00:00
6c5c2121b1 Run some OOMing tests serially (#117759)
They were disabled due to being flaky due to OOMs but got renamed.  Seeing if running serially helps

I kind of want to keep this test disabled since the rest of the file is probably fine...

Issues in question: #113132 #113136 #113140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-19 16:45:35 +00:00
de25718300 [release] Docker Release build trigger on rc for testing (#117849)
Enable triggering the Docker Release builds on RC. Use test channel in this case. Hence following logic is applied:
1. On RC trigger use test channel and upload to pytorch-test : https://github.com/orgs/pytorch/packages/container/package/pytorch-test
2. On Final RC use prod channel and upload to pytorch : https://github.com/orgs/pytorch/packages/container/package/pytorch
3. Nightly: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117849
Approved by: https://github.com/malfet
2024-01-19 15:01:46 +00:00
03b12e56c7 accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)
When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function.

Simple benchmark on AMD 3600 CPU Ubuntu 22.04:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|original|1986|1658|
|this PR|1295|995|

faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code.

CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned.

The simple benchmark cpp file:
[demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539
Approved by: https://github.com/lezcano
2024-01-19 14:56:43 +00:00
98a044d33e [CI] Build M1 conda binaries on M1 runners (#117801)
As usual, almost no work on PyTorch side, all changes are on the builder end, namely:
- 8b67d32929 - depend on `blas * mkl` only on x86 machines
- eb78393f1e - install arm64 conda when running on Apple Silicon
- 0d3aea4ee0 - constrain llvmdev-9 to x86 machines only
- 6c6a33b271 - set correct DEVELOPER_DIR path

TODO:
 - We should auto-detect this `DEVELOPER_DIR` via `xcode-select`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117801
Approved by: https://github.com/atalman
2024-01-19 14:31:12 +00:00
17c5f69852 Run test_jit with PYTORCH_TEST_WITH_DYNAMO=1 in CI (#117765)
Gets rid of all the single test excludes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117765
Approved by: https://github.com/voznesenskym
2024-01-19 13:42:41 +00:00
f115f1cde1 [Quant] Enable QConv2d with hardswish post op (#117487)
**Summary**
Enable QConv2d implementation with post op `hardswish`

**Test Plan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_hardswish_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117487
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-01-19 13:24:06 +00:00
cyy
5756b7a08e Remove math_compat.h (#117828)
Follows #116167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117828
Approved by: https://github.com/malfet
2024-01-19 12:56:17 +00:00
f2d6e99f8d Workaround a cusolver bug on CUDA < 12.1 in triangular_solve (#117636)
Fix https://github.com/pytorch/pytorch/issues/79191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117636
Approved by: https://github.com/malfet
2024-01-19 12:42:37 +00:00
suo
4057d005ff Initial torchbind support in PT2 (#117697)
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.

It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.

Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.

Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support

Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.

Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
2024-01-19 06:28:20 +00:00
c51a4e64c0 Add support for compiling SDPAParams (#117207)
Allows us to `allow_in_graph` this `torch._C` struct for supporting scaled dot product attention.
helps unblock https://github.com/pytorch/pytorch/pull/116071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117207
Approved by: https://github.com/voznesenskym
2024-01-19 05:51:15 +00:00
8524fa566c [executorch hash update] update the pinned executorch hash (#117593)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117593
Approved by: https://github.com/pytorchbot
2024-01-19 04:34:12 +00:00
f302a0d380 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-19 04:28:50 +00:00
924ed91612 Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738)
Fixes #117517

Try to move nccl related function *getDurationFromFirstEvent* to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-01-19 04:28:47 +00:00
cyy
38d9b3d937 Remove use of math_compat.h (#116167)
Because  ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167
Approved by: https://github.com/ezyang
2024-01-19 03:37:55 +00:00
cyy
5c17f66a3d [Exception] [5/N] Remove torch::IndexError (#117713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713
Approved by: https://github.com/ezyang
2024-01-19 03:36:15 +00:00
3131e0460e Changed return type of randint64_cpu to int64_t to prevent codegen is… (#117443)
…sues.

Fixes #117435.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117443
Approved by: https://github.com/ezyang
2024-01-19 03:23:20 +00:00
1adf77ce5e Don't use functional tensor inside _unstack_pytree (#117811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117811
Approved by: https://github.com/ydwu4
2024-01-19 03:15:06 +00:00
c16e6e4cf7 [ProcessGroup] Make watchdog check work queue more frequently (#117297)
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.

Take DDP and Ampere for example:

DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s

25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.

Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
2024-01-19 02:33:31 +00:00
aadbaf8e2d [EZ][BE] Move build_android_gradle.sh (#117795)
From `.circleci/scripts` to `scripts`, next to another `build_android.sh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117795
Approved by: https://github.com/huydhn
2024-01-19 02:14:28 +00:00
d618e86328 [ONNX] Bump transformers in CI test (#117703)
Fixes #117660

(1) skip dynamic tests for exported program in `test_fx_to_onnx_onnxruntime.py`, as they are not expected to pass anyway.
(2) Move dolly model to runtime, since it's working in exporting, but it is blocked by non-persistent buffers as well.
(3) openai whisper has changed/regression due to modeling modifications.
(4) Replace OpenLlama with Llama, because OpenLlama is deprecated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117703
Approved by: https://github.com/thiagocrepaldi
2024-01-19 02:10:10 +00:00
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00
c317bf2c2b [HigherOrderOp][BE] factor out merge_graph_inputs (#116912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116912
Approved by: https://github.com/zou3519
ghstack dependencies: #116721, #116823
2024-01-19 00:35:26 +00:00
c6028f8f73 [HigherOrderOp] Add while_loop support (#116823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116823
Approved by: https://github.com/zou3519
ghstack dependencies: #116721
2024-01-19 00:35:26 +00:00
113f0749f5 [HigherOrderOp] move some common utils in cond to utils.py (#116721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116721
Approved by: https://github.com/zou3519
2024-01-19 00:35:26 +00:00
77cfacab55 Revert "Reduce pytest prints (#117069)"
This reverts commit 2f89ef23007626aca1a577a4a388e315253c834f.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))
2024-01-19 00:27:03 +00:00
a468b9fbdf Update xla.txt to fix missing commit (#117708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117708
Approved by: https://github.com/masnesral, https://github.com/huydhn
2024-01-18 23:51:51 +00:00
2f84a9d37c Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)"
This reverts commit 5aa92b5090e3db4a053548a3f360dd06c16df2f7.

Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))
2024-01-18 23:40:30 +00:00
2f89ef2300 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-18 23:30:59 +00:00
e432b2e607 [inductor] multi-kernel support (#103469)
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction

A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.

Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code.  After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.

***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel.  But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list.  Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.

I'm not sure if there is some easy and clean way to resolve this.

Testing command:
```

TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel
2024-01-18 23:16:31 +00:00
fee96adde7 [EZ] Update weekly.yml to use actions from test-infra (#117775)
It was deleted from `pytorch/pytorch` by https://github.com/pytorch/pytorch/pull/117506

Thanks [BowenBao](https://github.com/BowenBao) for alerting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117775
Approved by: https://github.com/huydhn
2024-01-18 22:58:32 +00:00
6d9432c44c [ONNX][dynamo_export] Decomposition skips using custom operator (#117314)
A context manager that disables the decomposition of certain ops during dynamo tracing.

The approach is to temporarily hijack the operator callable with PT2 custom operator.
The custom operator will not be decomposed and will show up as a single node to be exported to ONNX.

For the time being the decomposition of these ops is otherwise unavoidable.

https://github.com/pytorch/pytorch/issues/116684
https://github.com/pytorch/pytorch/issues/115883

This solution will no longer be required once the issue is resolved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117314
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-01-18 22:19:28 +00:00
92d718aed1 [export] Add lifted constant obj to input (#116985)
Test Plan: wip

Differential Revision: D52556070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985
Approved by: https://github.com/suo
2024-01-18 22:10:53 +00:00
eba5d5485d [dynamo] make ConstantSource propagate through built-in ops for TensorVariable (#117704)
Fixes #117685.

This PR only makes ConstantSource perserved for built-in ops when we find all the inputs are either constant tensors or python constants.

 It doesn't fundamentally solve the problem of preserving ConstantSource information through all operators that's potentially can be constant folded.

For the following code in the issue:
```
class Bob(torch.nn.Module):
    def __init__(self, p, val) -> None:
        super().__init__()
        self.p = p
        self.y = torch.nn.Parameter(torch.tensor(val))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # This only looks dynamic but it's actually a constant value
        if get_y(self.y) < self.p:
            return torch.cat([x,x])
        else:
            return x
```
The graph exported looks like following:
```python
class GraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: "f32[s0, s1]";

        arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        l_x_ = arg0

        # File: /home/yidi/local/pytorch/test/dynamo/test_export.py:1498 in forward, code: return torch.cat([x, x])
        cat = torch.cat([l_x_, l_x_]);  l_x_ = None
        return pytree.tree_unflatten([cat], self._out_spec)
```

Test Plan:
Added a new test for the given repro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117704
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-01-18 20:18:34 +00:00
1462d72904 Speed up triu_tril_kernel (#115013)
1. Batch Processing: Enhance kernel efficiency by having each thread handle multiple elements, reducing the frequency of offset calculations.
2. Inplace Operation Optimization: For inplace functions, eliminate unnecessary copying to enhance performance.

Up to 5x speed up compared to torch 2.1.1

# Benchmark
Test on NVIDIA RTX 3080, WSL, CUDA 12.1. Peak performance is recorded.

  | function | dtype | shape | k | torch 2.1.1 | this PR | speed up
-- | -- | -- | -- | -- | -- | -- | --
various   dtype |   |   |   |   |   |  
  | triu_ | int8 | [1, 3072, 3072] | 0 | 0.107 | 0.028 | 3.76x
  | triu_ | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.029 | 3.79x
  | triu_ | float32 | [1, 3072, 3072] | 0 | 0.114 | 0.045 | 2.52x
  | triu_ | float64 | [1, 3072, 3072] | 0 | 0.172 | 0.082 | 2.11x
  | triu | int8 | [1, 3072, 3072] | 0 | 0.111 | 0.056 | 2.00x
  | triu | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.049 | 2.22x
  | triu | float32 | [1, 3072, 3072] | 0 | 0.116 | 0.091 | 1.27x
  | triu | float64 | [1, 3072, 3072] | 0 | 0.175 | 0.176 | 1.00x
various   shape |   |   |   |   |   |  
  | triu_ | float32 | [1, 8192, 8192] | 0 | 0.798 | 0.311 | 2.56x
  | triu_ | float32 | [4, 1024, 1024] | 0 | 0.054 | 0.023 | 2.37x
  | triu_ | float32 | [4, 1021, 1021] | 0 | 0.054 | 0.023 | 2.33x
  | triu_ | float32 | [256, 128, 256] | 0 | 0.111 | 0.038 | 2.92x
  | triu_ | float32 | [128, 257, 125] | 0 | 0.051 | 0.029 | 1.77x
  | triu_ | float32 | [20480, 16, 16] | 0 | 0.072 | 0.036 | 1.97x
  | triu | float32 | [1, 8192, 8192] | 0 | 0.797 | 0.611 | 1.31x
  | triu | float32 | [4, 1024, 1024] | 0 | 0.056 | 0.042 | 1.32x
  | triu | float32 | [4, 1021, 1021] | 0 | 0.058 | 0.044 | 1.32x
  | triu | float32 | [256, 128, 256] | 0 | 0.114 | 0.093 | 1.22x
  | triu | float32 | [128, 257, 125] | 0 | 0.051 | 0.036 | 1.43x
  | triu | float32 | [20480, 16, 16] | 0 | 0.075 | 0.061 | 1.23x
various dim |   |   |   |   |   |  
  | triu_ | float32 | [3072, 3072] | 0 | 0.093 | 0.037 | 2.49x
  | triu_ | float32 | [1, 3072, 3072] | 0 | 0.114 | 0.045 | 2.52x
  | triu_ | float32 | [1, 1, 3072, 3072] | 0 | 0.138 | 0.053 | 2.60x
  | triu | float32 | [3072, 3072] | 0 | 0.097 | 0.091 | 1.07x
  | triu | float32 | [1, 3072, 3072] | 0 | 0.116 | 0.091 | 1.27x
  | triu | float32 | [1, 1, 3072, 3072] | 0 | 0.140 | 0.090 | 1.55x
various k |   |   |   |   |   |   |  
  | triu_ | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.029 | 3.79x
  | triu_ | float16 | [1, 3072, 3072] | 1536 | 0.103 | 0.042 | 2.44x
  | triu_ | float16 | [1, 3072, 3072] | -1536 | 0.114 | 0.020 | 5.68x
  | triu | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.049 | 2.22x
  | triu | float16 | [1, 3072, 3072] | 1536 | 0.104 | 0.039 | 2.65x
  | triu | float16 | [1, 3072, 3072] | -1536 | 0.115 | 0.058 | 2.00x

# Benchmark Code

```python3
import time
import torch

torch.manual_seed(42)

def timeit(f, run_times=1000):
    torch.cuda.synchronize()
    t1 = time.time()
    for _ in range(run_times):
        f()
    torch.cuda.synchronize()
    t2 = time.time()
    return (t2 - t1) / run_times

for dtype in [torch.int8, torch.float16, torch.float32, torch.float64]:
    for shape in [
        [1, 8192, 8192],
        [3072, 3072],
        [1, 3072, 3072],
        [1, 1, 3072, 3072],
        [4, 1024, 1024],
        [4, 1021, 1021],
        [256, 128, 256],
        [128, 257, 125],
        [20480, 16, 16],
    ]:
        for k in [0, shape[-1] // 2, -shape[-1] // 2]:
            a = torch.empty(shape, dtype=dtype, device="cuda")
            for _ in range(4):
                t_triu = timeit(lambda: a.triu(k))
                t_triu_ = timeit(lambda: a.triu_(k))
                t_clone = timeit(lambda: a.clone())
                print(dtype, shape, f"{k=}", f"triu_ {t_triu_ * 1000:.6f} ({t_triu_ / t_clone:.2f}xMemcpy)", f"triu {t_triu * 1000:.6f} ({t_triu / t_clone:.2f}xMemcpy)")

            a = torch.rand(shape, device="cuda")
            a = (a * 10).to(dtype)
            assert (a.triu(k) == a.cpu().triu(k).cuda()).all()
            assert (a.tril(k) == a.cpu().tril(k).cuda()).all()
            assert (a.clone().triu_(k) == a.triu(k)).all()
            assert (a.clone().tril_(k) == a.tril(k)).all()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115013
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-01-18 19:58:00 +00:00
16ebfbbf07 All tests run with markDynamoStrictTest now (#117763)
Last test to remove from the denylist was dynamo/test_logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117763
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117729, #117747, #117754, #117761
2024-01-18 19:42:41 +00:00
5278200507 Add some better docs for dynamo_test_failures.py (#117761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117761
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117729, #117747, #117754
2024-01-18 19:42:41 +00:00
07216721cf [codemod] markDynamoStrictTest batch 23 (#117754)
[codemod] markDynamoStrictTest test_custom_ops
[codemod] markDynamoStrictTest test_python_dispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117754
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117729, #117747
2024-01-18 19:37:04 +00:00
def4959662 Revert "[inductor] allow mm template to accumulate with float16 dtype (#117479)"
This reverts commit a7fbbc2a4a05fa4863f9d0e2adabcdc5e276c675.

Reverted https://github.com/pytorch/pytorch/pull/117479 on behalf of https://github.com/PaliC due to breaking tests internally ([comment](https://github.com/pytorch/pytorch/pull/117479#issuecomment-1899032973))
2024-01-18 18:53:37 +00:00
suo
23d53a4360 add test_public_bindings to internal CI (#117712)
enable this test in meta-internal CI, since it's mildly infuriating to not be able to locally test this when working inside meta

One change:
This test uses `pkgutil.walk_packages`, which ignores namespace packages. A quirk in Meta's internal python packaging system is that it adds `__init__.py` to each source directory. So this test picks up more files to check internally than in the GitHub CI.

So I changed this test from using raw `pkgutil` to a version that also looks into namespace packages, so we're checking the same thing across both CIs.

Differential Revision: [D52857631](https://our.internmc.facebook.com/intern/diff/D52857631/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117712
Approved by: https://github.com/ezyang
2024-01-18 18:20:43 +00:00
1b773df3c6 Place .lrodata later in the binary (#117575)
Summary:
By default, in LLD 16, .lrodata is placed immediately after .rodata.
However, .lrodata can be very large in our compiled models, which leads to
relocation out-of-range errors for relative relocations. So we place it
after other the sections that are referenced from .text using relative
relocations. This is the default behavior in GNU ld.
Reviewed By: muchulee8, desertfire, khabinov, chenyang78

Differential Revision: D52557846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117575
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-01-18 17:58:18 +00:00
7451dd0585 Revert "Add node meta value into UnflattenedModule (#117686)"
This reverts commit cbf24ba962f72175ec1c71a25f3379f7d9149ec1.

Reverted https://github.com/pytorch/pytorch/pull/117686 on behalf of https://github.com/PaliC due to breaks internal modeling tests ([comment](https://github.com/pytorch/pytorch/pull/117686#issuecomment-1898939899))
2024-01-18 17:46:38 +00:00
5aa895e53e Don't run inductor tests in Dynamo shard (#117747)
In theory we could, but these get really slow once we turn on strict
mode, so we're not going to for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117729
2024-01-18 17:43:30 +00:00
646229218f Revert "[export] Error on not pytree-flattened nodes (#117598)"
This reverts commit 560213de2d8f734987e25680e72d565501ab8318.

Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/PaliC due to breaking executorch tests internally ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1898926720))
2024-01-18 17:37:59 +00:00
4720109d7f [dynamo] add common methods to DistributedVariable (#117590)
This PR refactors the distributed related variables to use
DistributedVariable for common methods, so that things like
`python_type` works for all distributed variables.

Maybe we can add `as_python_constant` to the DistributedVariable too? I
didn't add in this PR but if that make sense I can update.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117590
Approved by: https://github.com/voznesenskym
2024-01-18 17:32:31 +00:00
044b9012d5 Update PocketFFT (#117595)
This updates PocketFFT submodule to 9d3ab05a7f

Probably fixes https://github.com/pytorch/pytorch/issues/117589 (as it includes https://github.com/mreineck/pocketfft/issues/5 that should fix PocketFFT compilation on Windows)

Also adjust `#if __cplusplus >= 201703` replace path in Android scripts (need to submit the fix back to PocketFFT)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117595
Approved by: https://github.com/huydhn
2024-01-18 17:08:44 +00:00
db1a6eda9e [codemod] markDynamoStrictTest batch 22 (#117729)
[codemod] markDynamoStrictTest test_autograd
[codemod] markDynamoStrictTest test_ao_sparsity
[codemod] markDynamoStrictTest test_jit
[codemod] markDynamoStrictTest test_quantization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117729
Approved by: https://github.com/bdhirsh
2024-01-18 16:59:26 +00:00
fa86fa7a61 Fix MSVC 14.38 - VS 2022 Build (#117497)
Fixes #115922

This PR is prepared to separate existing https://github.com/pytorch/pytorch/pull/116926 and to apply suggestions in the review.

`scalar_t` which is defined as `c10::impl::ScalarTypeToCPPType<ScalarType::Half>::t` appears to be causing the issue with `Visual Studio 2022 17.8.4`  (coming with `MSVC 14.38.33130`)

Error message:
```
aten\src\ATen/cpu/vec/vec_base.h(150): fatal error C1001: Internal compiler error.
(compiler file 'D:\a_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\toinil.c', line 910)
```

---

Related line was added for a similar issue before as a workaround (`scalar_t` definition) [Fix compile error for vs2022](https://github.com/pytorch/pytorch/pull/85958)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117497
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-18 16:53:27 +00:00
a669319450 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-18 16:20:12 +00:00
6e4e81a9ef [dynamo] Extend LazyVariableTracker to tuples (#117426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-01-18 15:51:28 +00:00
26956980c6 [AOTI] Add torch._export.aot_load (#117610)
Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable.

Test Plan: CI

Differential Revision: D52825456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610
Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78
2024-01-18 15:02:16 +00:00
2fb9d8811f Don't try to directly compare symbols, it won't work (#117674)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117674
Approved by: https://github.com/lezcano
2024-01-18 12:18:45 +00:00
8bf788c390 [SAC][Dynamo] Add support for functools.partial in CheckpointHigherOrderVariable (#117657)
# Context

In some cases, we might want to build the `context_fn` with runtime-defined policies. One way of implementing this is to make `context_fn` be a partial, which holds the information that we want to pass. One concrete example is the [automatic policy selection from `xformers`](ad986981b1/xformers/checkpoint.py (L185)).

# The problem

The previous implementation wouldn't work with partials because `FunctoolsPartialVariable` doesn't have a `fn` attribute.

This PR addresses this case, but ideally we could get this solved in a more general fashion, as callable classes and `NestedUserFunctionVariable` are not supported by this PR.

# Tests

I've added a basic test that mimics the tests around it. The tests could probably be simplified, but I've decided to keep changes to a minimum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117657
Approved by: https://github.com/yf225
2024-01-18 11:59:23 +00:00
b0084be114 Revert "Re-enable SGD (#117434)"
This reverts commit e7fac72be75a9fa7a31c6fc8062364fdfc4aaa3a.

Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))
2024-01-18 11:37:36 +00:00
0d1e7053ac [easy] Log guard failure (#117639)
Facilitates greatly debugging guard creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117639
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #112252, #117630, #110524, #108420
2024-01-18 09:37:33 +00:00
4ba5318d3f [dynamo] Add DictView variable tracker (#108420)
This also starts a comparison pattern where we don't ask variables
what's their type, but what are their capabilities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108420
Approved by: https://github.com/jansel
ghstack dependencies: #112252, #117630, #110524
2024-01-18 09:37:33 +00:00
f4df0f061c Implement set in terms of dict (#110524)
This allows to heavily simplify the implementation of set, which was
"quite unique". Now we represent a set a as a dict where all its values
are None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110524
Approved by: https://github.com/jansel
ghstack dependencies: #112252, #117630
2024-01-18 09:36:41 +00:00
bc85eb948f Break on unsupported keys for dicts / elements for sets (#117630)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117630
Approved by: https://github.com/jansel
ghstack dependencies: #112252
2024-01-18 09:35:46 +00:00
4512a95371 [easy]Remove specialized value (#112252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112252
Approved by: https://github.com/jansel
2024-01-18 09:34:50 +00:00
2dd4a254a0 add Half support for interpolate operators on CPU (#105648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-18 09:07:16 +00:00
c9528a11dd Add Half support for masked_softmax on CPU (#117028)
Add Half support for `masked_softmax` on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117028
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-18 08:59:20 +00:00
e60bc502b4 [Inductor Intel GPU backend Upstream] Generalize part of Inductor test case (#117513)
Following the RFC https://github.com/pytorch/pytorch/issues/114856, before upstream Intel XPU Inductor Backend, we need to preapre corresponding Inductor test cases. This PR aims to generalize part of Inductor test case so that a new GPU backend can reuse the existing test case with minimal code change.

This Pull Request preferentially generalizes the test cases that cover Inductor's base functionality as follow:
- test/inductor/test_codecache.py
- test/inductor/test_codegen_triton.py
- test/inductor/test_kernel_benchmark.py
- test/inductor/test_torchinductor.py
- test/inductor/test_torchinductor_codegen_dynamic_shapes.py
- test/inductor/test_torchinductor_dynamic_shapes.py
- test/inductor/test_torchinductor_opinfo.py
- test/inductor/test_triton_heuristics.py
- test/inductor/test_triton_wrapper.py

Feature request: https://github.com/pytorch/pytorch/issues/114856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117513
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-01-18 08:26:21 +00:00
cyy
b72ddbab60 [Clang-tidy header][15/N] Enable clang-tidy on headers in c10/cuda and c10/mobile (#116602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116602
Approved by: https://github.com/ezyang
2024-01-18 08:15:50 +00:00
57ca455471 [dynamo] Add hasattr support for TupleVariable (#117694)
Summary:
This change adds support hasattr support for TupleVariable in dynamo.

This fix is part of: https://github.com/pytorch/pytorch/issues/117670

Test Plan: Unit test and CI

Differential Revision: D52850665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117694
Approved by: https://github.com/yanboliang
2024-01-18 07:47:43 +00:00
bc9cb04822 Replaced CHECK with TORCH_CHECK in order to not abort, but throw a Ru… (#117653)
…ntimeError instead.

Fixes #117499.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117653
Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG, https://github.com/alanwaketan
2024-01-18 07:47:22 +00:00
e7fac72be7 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-18 06:47:15 +00:00
79811e765c [2/4] Intel GPU Runtime Upstreaming for Device (#116833)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR  covers the changes under `aten`.

# Design
We will compile the code for XPU separately into a library named `libtorch_xpu.so`. Currently, it primarily offers device-related APIs, including
- `getCurrentDeviceProperties`
- `getDeviceProperties`
- `getGlobalIdxFromDevice`
- `getDeviceFromPtr`

# Additional Context
`XPUHooks` is an indispensable part of the runtime. We upstream `XPUHooks` in this PR since there is some code related to `Device` in it and we also refine some logic and code to avoid forward declaration in `DLPack`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116833
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-01-18 05:02:42 +00:00
61ea3036bc Allow explicit shutdown of the compile-worker pools (#117664)
Summary: Allow the trainer to explicitly shutdown the compile-worker pools to save CPU resource, thereby avoiding QPS degradation.

Test Plan: See the test plan in D52839313
Differential Revision: D52839313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117664
Approved by: https://github.com/yanboliang
2024-01-18 04:56:11 +00:00
1859895ffa Docs: fix docstring errors in model_averaging (#117038)
pydocstyle check

averagers.py

Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`:
        D102: Missing docstring in public method
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`:
        D400: First line should end with a period (not '`')
6

Post
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`:
        D102: Missing docstring in public method
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
4

utils.py

Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:17 in public function `average_parameters`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:68 in public function `average_parameters_or_parameter_groups`:
        D200: One-line docstring should fit on one line with quotes (found 3)
5

Post
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level:
        D100: Missing docstring in public module
1

hierarchical_model_averager.py

Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:16 in public class `HierarchicalModelAverager`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:98 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
        D400: First line should end with a period (not ',')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`:
        D400: First line should end with a period (not '`')
8

Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:99 in public method `__init__`:
        D107: Missing docstring in __init__
2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117038
Approved by: https://github.com/H-Huang
2024-01-18 04:12:51 +00:00
4f2620ce56 [PT2][split_cat] fix a bug in merge_splits (#117707)
Summary: Recently, we found merge splits (D45204109) is not working for AFOC model, thus patch a fix.

Test Plan:
The error log: P1046934021
# Flows used to local reproduce
### non-first:
f522317780
after the fix: P1047603217
### first:
f522253163
after the fix: P1047764917

Differential Revision: D52856359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117707
Approved by: https://github.com/jackiexu1992
2024-01-18 04:04:32 +00:00
suo
02c96f6949 [export] modify torch.export tests to pass a Module in (#117572)
We have a lot of tests that pass a function to torch.export.

We are planning to disallow this, so fix up the tests to pass a module in.

Differential Revision: [D52791309](https://our.internmc.facebook.com/intern/diff/D52791309/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117572
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #117570, #117571
2024-01-18 03:40:40 +00:00
suo
ccc8440609 [export] introduce WrapperModule (#117571)
Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module.

Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571
Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri
ghstack dependencies: #117570
2024-01-18 03:40:34 +00:00
suo
5697986482 [export] change exportdb to require torch.nn.Module (#117570)
Part of the effort to make torch.export require nn.Module.

Differential Revision: [D52631366](https://our.internmc.facebook.com/intern/diff/D52631366/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117570
Approved by: https://github.com/tugsbayasgalan
2024-01-18 03:40:10 +00:00
41153542ae Use wait stream instead of synchronize() in cudagraph warmup (#117578)
Fix for https://github.com/pytorch/pytorch/issues/113895

There are three phases to cudagraph trees. Warmup, recording, and execution. On recording and execution we are executing under the current_stream. In warmup we execute under a side stream that we also use for cudagraph recording so as to reuse memory.

After we execute on the side stream we need to sync the current stream to the side stream. Previously there was a `torch.cuda.synchronize` but not a `torch.cuda.current_stream().wait_stream(stream)`. This PR removes the global sync and adds a wait_stream. I have confirmed that it fixes https://github.com/pytorch/pytorch/issues/113895.

It's not entirely clear me why torch.cuda.synchronize would be insufficient - I would have thought the global sync would encompass the stream to stream sync. However, we do have a number of [instances](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L748-L749) throughout the code base where we do a stream->stream sync after the global sync so clearly I am missing something here. In any case the stream->stream sync is better perf than a global synchronize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117578
Approved by: https://github.com/zdevito
2024-01-18 03:33:44 +00:00
560213de2d [export] Error on not pytree-flattened nodes (#117598)
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".

The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
2024-01-18 03:06:42 +00:00
634ce3c913 Document and type torch._inductor.virtualized (#117658)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117658
Approved by: https://github.com/eellison, https://github.com/peterbell10
ghstack dependencies: #117650
2024-01-18 03:03:20 +00:00
16ff6cd340 Catch some missing unbacked symbol dependencies (#117650)
Whenever an IR node has reference to an unbacked SymInt, we must
register it as a use of the unbacked SymInt.

This fix isn't complete but the rest of the fix is fairly difficult, so
putting this in to start.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117650
Approved by: https://github.com/lezcano
2024-01-18 03:03:20 +00:00
cb2b98ad6b [codemod] markDynamoStrictTest batch 21 (#117609)
[codemod] markDynamoStrictTest test_torch
[codemod] markDynamoStrictTest test_ops_gradients
[codemod] markDynamoStrictTest test_ops
[codemod] markDynamoStrictTest test_modules
[codemod] markDynamoStrictTest test_ops_jit
[codemod] markDynamoStrictTest test_ops_fwd_gradients
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117609
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117700, #117701, #117702
2024-01-18 02:49:26 +00:00
bbf65bc451 Revert "[Dynamo] Remove the workaround since it has been fixed (#117615)"
This reverts commit b3e2571e83eff4a5ce45a7ad037c2fa2df87da9d.

Reverted https://github.com/pytorch/pytorch/pull/117615 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it seems to start failing some dynamo tests in trunk b3e2571e83.  I try to disable the failed test but yet another one shows up ([comment](https://github.com/pytorch/pytorch/pull/117615#issuecomment-1897683076))
2024-01-18 02:48:34 +00:00
cbf24ba962 Add node meta value into UnflattenedModule (#117686)
Fixes #116670
Following the lead of #116720, added node.meta['val'] back to newly created subgraphs.

node.meta['val'] is essential to ONNX in terms of the shape and type information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117686
Approved by: https://github.com/angelayi
2024-01-18 02:37:15 +00:00
6d96beb6be [c10d] Remove health check (#117699)
https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called).

If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699
Approved by: https://github.com/wconstab
2024-01-18 02:14:49 +00:00
21ddca4225 Enable HIP build for //sigrid/predictor:pytorch_disagg_gpu_task (#117616)
Summary: Tweak some header include, as well as explicitly ignore hipEventDestroy return value.

Test Plan: CI

Reviewed By: jiaqizhai

Differential Revision: D52722234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117616
Approved by: https://github.com/xw285cornell
2024-01-18 01:37:50 +00:00
3882714168 Fix check-labels.yml for ghstack PRs (#117680)
Otherwise check-labels doesn't run on ghstack PRs, see https://github.com/pytorch/pytorch/pull/117609 for example: no Check Labels workflow run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117680
Approved by: https://github.com/izaitsevfb
2024-01-18 01:33:55 +00:00
f7143b79bd Stricter pull_request_target in labeler.yml (#117677)
Copied from https://github.com/pytorch/pytorch/blob/main/.github/workflows/check-labels.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117677
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2024-01-18 01:33:49 +00:00
58c4bc62bb [c10d] Deprecate Work.result() (#117565)
Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather).

It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs.

Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565
Approved by: https://github.com/wconstab
2024-01-18 01:22:37 +00:00
5aa92b5090 [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-01-18 01:20:36 +00:00
a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
a1afd1b195 Revert "[inductor] Faster C++ kernel python bindings (#117500)"
It should have never been landed, but was landed again, thanks to
ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910

This reverts commit e457b6fb18782425661e8a09d0222d0b29518ad1.
2024-01-17 17:06:32 -08:00
410515241d [c01d] Remove CoalescedWorkNCCL (#117696)
`CoalescedWorkNCCL` is dead code now. Nowhere is it used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117696
Approved by: https://github.com/wconstab
2024-01-18 01:00:43 +00:00
387ea260af [c10d] Enable watchdog for coalesced work (#117682)
Fixes https://github.com/pytorch/pytorch/issues/114301

Previously, coalesced work (created by `end_coalescing`) is not watched by watchdog, which results in silent timeout.

The culprit is that we reset `coalescing_state_` to 0 before checking it to see if we should enqueue a work.

Example:
```
import torch
import torch.distributed as dist
from datetime import timedelta

dist.init_process_group(backend="nccl", timeout=timedelta(seconds=10))
rank = dist.get_rank()
world_size = dist.get_world_size()
device = torch.device(f"cuda:{rank}")

# Create tensors of different sizes to create hang
s = 100 * 1024 * 1024 * (world_size - rank)
with dist._coalescing_manager(device=device):
    dist.all_reduce(torch.ones(s, device=device))
    dist.broadcast(torch.ones(s, device=device), src=0)

torch.cuda.synchronize()
print(f"{dist.get_rank()} done")

```

Watchdog fires:
```
$ torchrun --nproc-per-node 2 example.py
...
[rank1]:[E ProcessGroupNCCL.cpp:545] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10000 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:545] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10567 milliseconds before timing out.
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117682
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-01-18 00:42:36 +00:00
cyy
396a5c3091 [Exception] [4/N] Replace torch::IndexError and torch::ValueError with C10 counterparts (#117317)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117317
Approved by: https://github.com/ezyang
2024-01-18 00:35:29 +00:00
c64fd8b89c [codemod] markDynamoStrictTest batch 20 (#117702)
[codemod] markDynamoStrictTest test_tensorexpr_pybind
[codemod] markDynamoStrictTest test_tensorexpr
[codemod] markDynamoStrictTest test_jit_llga_fuser
[codemod] markDynamoStrictTest test_jit_fuser_te

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117702
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117700, #117701
2024-01-18 00:30:22 +00:00
3770311093 [codemod] markDynamoStrictTest batch 19 (#117701)
[codemod] markDynamoStrictTest export/test_verifier
[codemod] markDynamoStrictTest export/test_upgrade
[codemod] markDynamoStrictTest export/test_unflatten
[codemod] markDynamoStrictTest export/test_serialize
[codemod] markDynamoStrictTest export/test_serdes
[codemod] markDynamoStrictTest export/test_retraceability
[codemod] markDynamoStrictTest export/test_passes
[codemod] markDynamoStrictTest export/test_pass_infra
[codemod] markDynamoStrictTest export/test_functionalized_assertions
[codemod] markDynamoStrictTest export/test_export_nonstrict
[codemod] markDynamoStrictTest export/test_export
[codemod] markDynamoStrictTest export/test_experimental
[codemod] markDynamoStrictTest export/test_db

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117701
Approved by: https://github.com/bdhirsh, https://github.com/malfet
ghstack dependencies: #117700
2024-01-18 00:30:22 +00:00
82c0083819 Fix trition wheels build (take 2) (#117706)
Sorry, I should have been more thorough in reviewing https://github.com/pytorch/pytorch/pull/117648 Triton wheels are built of `main` branch, rather than `nightly`, see
2db53a01e5/.github/workflows/build-triton-wheel.yml (L1-L6)

Test plan: merge and hope for the best :P

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117706
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-01-18 00:26:36 +00:00
898f6a48a9 [codemod] markDynamoStrictTest batch 18 (#117700)
[codemod] markDynamoStrictTest functorch/test_vmap
[codemod] markDynamoStrictTest profiler/test_profiler_tree
[codemod] markDynamoStrictTest profiler/test_profiler
[codemod] markDynamoStrictTest profiler/test_memory_profiler
[codemod] markDynamoStrictTest functorch/test_ops
[codemod] markDynamoStrictTest functorch/test_aotdispatch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117700
Approved by: https://github.com/bdhirsh
2024-01-18 00:25:38 +00:00
b3e2571e83 [Dynamo] Remove the workaround since it has been fixed (#117615)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117615
Approved by: https://github.com/angelayi
2024-01-18 00:21:22 +00:00
3114813314 Replace constraints with dynamic_shapes in deeplearning/aot_inductor test (#117573)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `deeplearning/aot_inductor/test/test_custom_ops.py`.

Test Plan: buck test mode/dev-nosan fbcode//deeplearning/aot_inductor/test:test_custom_ops -- test_export_extern_fallback_nodes_dynamic_shape

Differential Revision: D52790332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117573
Approved by: https://github.com/angelayi
2024-01-17 23:50:08 +00:00
2db53a01e5 propagate torch stack trace metadata to copy_() nodes during input mutations (#117587)
Tested by running the below script:
```
import torch
@torch.compile(backend="aot_eager", fullgraph=True)
def f(x):
    y = x.view(-1)
    y.mul_(2)
    return

x = torch.ones(4)
f(x)
```

Which gives me this ATen graph (notice that the copy_() node is bundled under the stacktrace for `mul_(2)`):
```
 ===== Forward graph 0 =====
 <eval_with_key>.2 from /data/users/hirsheybar/e/pytorch/torch/fx/experimental/proxy_tensor.py:521 in wrapped class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[4]"):
        # File: /data/users/hirsheybar/e/pytorch/tmp5.py:8, code: y = x.view(-1)
        view: "f32[4]" = torch.ops.aten.view.default(arg0_1, [-1])

        # File: /data/users/hirsheybar/e/pytorch/tmp5.py:9, code: y.mul_(2)
        mul: "f32[4]" = torch.ops.aten.mul.Tensor(view, 2);  view = None
        view_1: "f32[4]" = torch.ops.aten.view.default(mul, [4]);  mul = None
        copy_: "f32[4]" = torch.ops.aten.copy_.default(arg0_1, view_1);  arg0_1 = view_1 = None
        return ()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117587
Approved by: https://github.com/eellison
2024-01-17 23:07:45 +00:00
26a63907ba Ordering placeholder and get_attr nodes in unflattened module (#116910)
Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes.

Before:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
get_attr       bias         bias                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
placeholder    l_x_         l_x_                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

After:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
placeholder    l_x_         l_x_                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
get_attr       bias         bias                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #117409, #116667, #117591, #117500
2024-01-17 23:03:15 +00:00
e457b6fb18 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409, #116667, #117591
2024-01-17 23:03:15 +00:00
763ddb396d Revert "[codemod] markDynamoStrictTest batch 18 (#117604)"
This reverts commit 24f288114a696a27771c075b8e8df556c13eced6.

Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117604#issuecomment-1897082562))
2024-01-17 22:16:27 +00:00
01c0c67937 Revert "[codemod] markDynamoStrictTest batch 19 (#117605)"
This reverts commit 0cda1e0b218895ce6121531991348b8bcbce9b94.

Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117605#issuecomment-1897065994))
2024-01-17 22:12:59 +00:00
87c2427173 Revert "[codemod] markDynamoStrictTest batch 20 (#117606)"
This reverts commit 308e154af5fd6388f49eabe631e7b78ca3ac9c39.

Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117606#issuecomment-1897042843))
2024-01-17 22:08:20 +00:00
84cfe6d8b2 Drop all gather stats to debug not warning (#117669)
Logger default level results in these all gather stats being spammed into every run which is very annoying

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117669
Approved by: https://github.com/Skylion007, https://github.com/awgu
2024-01-17 21:44:59 +00:00
8841d26046 [dynamo] LazyVariable - redirect __str__ to the realized variable __str__ (#117583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117583
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-01-17 21:12:12 +00:00
a7fbbc2a4a [inductor] allow mm template to accumulate with float16 dtype (#117479)
Fixes #108621

replace #108637 and #108982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117479
Approved by: https://github.com/jansel
2024-01-17 21:01:14 +00:00
208e64a9ba Initial implementation of FakeTensor caching (#113873)
Summary: Cache the result of FakeTensor dispatch and skip re-evaluation on cache hits.

Test Plan: New unit tests. Caching is enabled in this diff, so all existing tests exercise the cache as well.

Differential Revision: [D52841637](https://our.internmc.facebook.com/intern/diff/D52841637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113873
Approved by: https://github.com/eellison
2024-01-17 20:38:54 +00:00
c0940d2e93 [pytree] reuse flatten_fn in flatten_with_keys_fn to ensure consistency (#117656)
Reuse `flatten_fn` in `flatten_with_keys_fn` to ensure `flatten_fn` and `flatten_with_keys_fn` get the same `leaves` and `context`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117656
Approved by: https://github.com/suo
2024-01-17 20:38:49 +00:00
bffc8ecfb0 [codemod] Fix shadows in PyTorch (#117562)
Test Plan: Sandcastle

Differential Revision: D52802592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117562
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-01-17 20:33:50 +00:00
da6abaeeac Revert "[inductor] Faster C++ kernel python bindings (#117500)"
This reverts commit bb0fd1bd3ca145b77159427bc5bacf5f98ec3896.

Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))
2024-01-17 19:34:26 +00:00
cb0bfcf590 Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910)"
This reverts commit 12561bb5fed08283baf7a31e6678341a04e83adb.

Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))
2024-01-17 19:34:26 +00:00
89cf1ddb5c [AOTInductor] Allow user to explicitly specify Device to run on (#117413)
Summary:
AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run.

This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1".

todo:
- Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function.

Test Plan: CI

Differential Revision:
D52747132

Privacy Context Container: 368960445142440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
2024-01-17 19:28:04 +00:00
308e154af5 [codemod] markDynamoStrictTest batch 20 (#117606)
[codemod] markDynamoStrictTest test_tensorexpr_pybind
[codemod] markDynamoStrictTest test_tensorexpr
[codemod] markDynamoStrictTest test_jit_llga_fuser
[codemod] markDynamoStrictTest test_jit_fuser_te
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604, #117605
2024-01-17 19:20:11 +00:00
0cda1e0b21 [codemod] markDynamoStrictTest batch 19 (#117605)
[codemod] markDynamoStrictTest export/test_verifier
[codemod] markDynamoStrictTest export/test_upgrade
[codemod] markDynamoStrictTest export/test_unflatten
[codemod] markDynamoStrictTest export/test_serialize
[codemod] markDynamoStrictTest export/test_serdes
[codemod] markDynamoStrictTest export/test_retraceability
[codemod] markDynamoStrictTest export/test_passes
[codemod] markDynamoStrictTest export/test_pass_infra
[codemod] markDynamoStrictTest export/test_functionalized_assertions
[codemod] markDynamoStrictTest export/test_export_nonstrict
[codemod] markDynamoStrictTest export/test_export
[codemod] markDynamoStrictTest export/test_experimental
[codemod] markDynamoStrictTest export/test_db
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604
2024-01-17 19:20:11 +00:00
24f288114a [codemod] markDynamoStrictTest batch 18 (#117604)
[codemod] markDynamoStrictTest functorch/test_vmap
[codemod] markDynamoStrictTest profiler/test_profiler_tree
[codemod] markDynamoStrictTest profiler/test_profiler
[codemod] markDynamoStrictTest profiler/test_memory_profiler
[codemod] markDynamoStrictTest functorch/test_ops
[codemod] markDynamoStrictTest functorch/test_aotdispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219
2024-01-17 19:20:01 +00:00
006d655956 [codemod] markDynamoStrictTest batch 17 (#117219)
[codemod] markDynamoStrictTest test_xnnpack_integration
[codemod] markDynamoStrictTest test_vulkan
[codemod] markDynamoStrictTest test_package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219
Approved by: https://github.com/bdhirsh
2024-01-17 19:19:50 +00:00
1967165d4d [codemod] markDynamoStrictTest batch 16 (#117218)
[codemod] markDynamoStrictTest test_public_bindings
[codemod] markDynamoStrictTest test_package
[codemod] markDynamoStrictTest test_legacy_vmap
[codemod] markDynamoStrictTest test_namedtensor
[codemod] markDynamoStrictTest test_fx
[codemod] markDynamoStrictTest test_dataloader
[codemod] markDynamoStrictTest test_content_store
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest lazy/test_ts_opinfo
[codemod] markDynamoStrictTest functorch/test_vmap_registrations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
ghstack dependencies: #117409, #116667, #117591, #117500, #116910, #117553
2024-01-17 19:12:41 +00:00
ca0abf8606 Add inductor-specific testing strict mode denylist (#117553)
We have one for Dynamo that currently applies to all "compile"
configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I
don't want to figure out the inductor situation right now, so we're
going to add another denylist for inductor and work through it later.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117409, #116667, #117591, #117500, #116910
2024-01-17 19:12:41 +00:00
12561bb5fe Ordering placeholder and get_attr nodes in unflattened module (#116910)
Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes.

Before:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
get_attr       bias         bias                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
placeholder    l_x_         l_x_                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

After:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
placeholder    l_x_         l_x_                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
get_attr       bias         bias                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #117409, #116667, #117591, #117500
2024-01-17 19:12:33 +00:00
bb0fd1bd3c [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409, #116667, #117591
2024-01-17 19:12:24 +00:00
0c26565d5d Revert "Add pull request target to bc lint (#106065)"
This reverts commit d4136c90882337a0891f5216292e9e3d55c13262.

Reverted https://github.com/pytorch/pytorch/pull/106065 on behalf of https://github.com/izaitsevfb due to Tightening CI security ([comment](https://github.com/pytorch/pytorch/pull/106065#issuecomment-1896439167))
2024-01-17 18:51:46 +00:00
9da01affd3 Revert "[inductor] Faster C++ kernel python bindings (#117500)"
This reverts commit 3a52147cc59b240737602d3d046080bbf6f567f1.

Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
8c7e3a18ff Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910)"
This reverts commit 5e0e78585d9f662ecb957c327c8d3fa31bff4f9a.

Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
e877c2e6ff Revert "Add inductor-specific testing strict mode denylist (#117553)"
This reverts commit ab6207a34248fdf2d2766d0062f358b63380e151.

Reverted https://github.com/pytorch/pytorch/pull/117553 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
7f3cac06b9 Revert "[codemod] markDynamoStrictTest batch 16 (#117218)"
This reverts commit 46a8408fa123da571dc1c13dba9479ba6d540249.

Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
29fa6fbc4e [Dynamo] Fix a corner case of reinplace_inplaceable_ops pass for triton kernels (#117612)
Summary:
We saw the following failure when compiling custom triton kernels:
```
RuntimeError: Argument 'getitem_22' of Node 'triton_kernel_wrapper_functional_proxy_3' was used before it has been defined! Please check that Nodes in the graph are topologically ordered
```
The root-cause is when doing the replacement, the replacement is replaced by another replacement. The fix will keep finding the replacement until it is not replaced

Test Plan:

Added a test case

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117612
Approved by: https://github.com/aakhundov
2024-01-17 18:41:42 +00:00
e94b79f627 Revert "[codemod] markDynamoStrictTest batch 17 (#117219)"
This reverts commit 5bb2298da769121421711504da47955d3129b54f.

Reverted https://github.com/pytorch/pytorch/pull/117219 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
8483f493af Revert "[codemod] markDynamoStrictTest batch 18 (#117604)"
This reverts commit 70b22be32a2e6a1a51cb70a1418d73bfba533cc0.

Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
0bfd9653ef Revert "[codemod] markDynamoStrictTest batch 19 (#117605)"
This reverts commit 45d7859e751dff2096df8b346226b71cf6031424.

Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
d51583b214 Revert "[codemod] markDynamoStrictTest batch 20 (#117606)"
This reverts commit ab847a2f5c903c629f4e2ab9bfea11f7edc1cf0e.

Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
06dab05405 Revert "[export] Error on not pytree-flattened nodes (#117598)"
This reverts commit 35e847830511b2c700586d312177794be094d67e.

Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing ONNX test in trunk 35e8478305, probably a landrace as the PR signal looks fine ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1896389009))
2024-01-17 18:29:04 +00:00
d0fc268918 Fixed issue in upsample_nearestnd lowering with scales (#117538)
Fixed #116848

Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264

Originally, the code was
```python
def upsample_nearestnd(
    x,
    output_size,
    scales_x: Tuple[Optional[float], ...],
    n: int = 2,
    exact: bool = False,
):
   # ...
    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
    for i, scale in enumerate(scales):
        if scale:
            scales[i] = scale
```
which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output.
This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749
```python
def upsample_nearestnd(
    x,
    output_size,
    scales_x: Tuple[Optional[float], ...],
    n: int = 2,
    exact: bool = False,
):
   # ...
    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
    for i, scale in enumerate(scales_x):
        if scale:
            scales[i] = scale
```
however, this leads to a wrong scale value as it should be inverted as (1 / scale).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538
Approved by: https://github.com/peterbell10
2024-01-17 18:14:35 +00:00
ab847a2f5c [codemod] markDynamoStrictTest batch 20 (#117606)
[codemod] markDynamoStrictTest test_tensorexpr_pybind
[codemod] markDynamoStrictTest test_tensorexpr
[codemod] markDynamoStrictTest test_jit_llga_fuser
[codemod] markDynamoStrictTest test_jit_fuser_te
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604, #117605
2024-01-17 17:43:27 +00:00
45d7859e75 [codemod] markDynamoStrictTest batch 19 (#117605)
[codemod] markDynamoStrictTest export/test_verifier
[codemod] markDynamoStrictTest export/test_upgrade
[codemod] markDynamoStrictTest export/test_unflatten
[codemod] markDynamoStrictTest export/test_serialize
[codemod] markDynamoStrictTest export/test_serdes
[codemod] markDynamoStrictTest export/test_retraceability
[codemod] markDynamoStrictTest export/test_passes
[codemod] markDynamoStrictTest export/test_pass_infra
[codemod] markDynamoStrictTest export/test_functionalized_assertions
[codemod] markDynamoStrictTest export/test_export_nonstrict
[codemod] markDynamoStrictTest export/test_export
[codemod] markDynamoStrictTest export/test_experimental
[codemod] markDynamoStrictTest export/test_db
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604
2024-01-17 17:43:27 +00:00
70b22be32a [codemod] markDynamoStrictTest batch 18 (#117604)
[codemod] markDynamoStrictTest functorch/test_vmap
[codemod] markDynamoStrictTest profiler/test_profiler_tree
[codemod] markDynamoStrictTest profiler/test_profiler
[codemod] markDynamoStrictTest profiler/test_memory_profiler
[codemod] markDynamoStrictTest functorch/test_ops
[codemod] markDynamoStrictTest functorch/test_aotdispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219
2024-01-17 17:43:17 +00:00
6d1406d177 [oidc] Migrate Triton wheel upload to oidc (#117648)
Fix for triton upload job that is currently failing:
https://github.com/pytorch/pytorch/actions/runs/7555471235/job/20574022304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117648
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/malfet
2024-01-17 17:04:36 +00:00
35e8478305 [export] Error on not pytree-flattened nodes (#117598)
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".

The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri
2024-01-17 16:33:57 +00:00
40a6710ad3 Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`

Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2024-01-17 15:32:18 +00:00
5bb2298da7 [codemod] markDynamoStrictTest batch 17 (#117219)
[codemod] markDynamoStrictTest test_xnnpack_integration
[codemod] markDynamoStrictTest test_vulkan
[codemod] markDynamoStrictTest test_package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219
Approved by: https://github.com/bdhirsh
2024-01-17 14:41:07 +00:00
3bb8d2b905 Update triton ROCm version to 6.0 (#117433)
Related to PyTorch nightly wheels upgrade to ROCm6.0: https://github.com/pytorch/pytorch/pull/116983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117433
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2024-01-17 12:09:45 +00:00
e2830e6328 [PyTorch] SDPA decomp: actually use attn_mask (#117579)
Summary: Need to pass this along

Test Plan:
```
cd ~/fbsource/fbcode/executorch/backends/xnnpack/test
buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa
buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32
```

Reviewed By: larryliu0820

Differential Revision: D52812369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579
Approved by: https://github.com/larryliu0820
2024-01-17 10:26:43 +00:00
1deb75b584 [c10d] Move the timeout dump check from watchdog to monitoring thread (#117168)
To avoid potential hang in watchdog thread which will prevent us from dumping timeout debugging info, we move the check of global collective timeout signals and dumping debugging info to monitoring thread. We also need to ensure that we don't wait very long to check out the timeout signal from store; otherwise, we will miss the signal and don't get debugging info dumped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117168
Approved by: https://github.com/wconstab
2024-01-17 08:05:40 +00:00
ed6006ee5d [Reland][ONNX] Guard xfail tests with error messages (#117592)
Reland #117425

Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (xfail_if_model_type_is_not_exportedprogram). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117592
Approved by: https://github.com/BowenBao
2024-01-17 08:05:35 +00:00
suo
9448065061 [pytree] add key path api (#116786)
This PR introduces a key path API to pytrees, drawing direct inspiration from JAX's [key path API](https://jax.readthedocs.io/en/latest/jax-101/05.1-pytrees.html#key-paths).

I added the 3 APIs described there, and a registry of `flatten_with_keys` fns for each node type, which is a version of `flatten` that also returns `KeyEntry`s describing how to access values from the original pytree.

Current use cases for this API:
- Folks would like to do argument traversal over input pytrees to do verification and compatibility enforcement. Keypaths are useful for this—https://fburl.com/code/06p7zrvr is a handrolled pass doing basically the same thing but probably more fragilely.
- In export non-strict mode, we need to figure out a way to track sources for pytree inputs. In strict mode, dynamo handles this for us, but we'd like a decoupled component to handle this when we're not using dynamo.

I'm sure there are places it would be useful.

Some design notes:
- I only implemented the API for  the Python pytree impl. optree has some differences in how their keypath APIs are designed (see https://github.com/pytorch/pytorch/issues/113378 for discussion). I have some issues with the proposed typed_path solution in that discussion and prefer JAX's API, but we can hash that out separately.
- The way folks register a `flatten_with_keys` fn is through a new kwarg to `register_pytree_node`. This follows how we do serialization fns, although the list of additional arguments is getting unwieldy.
- My impl handles pytrees with an undefined `flatten_with_keys` fn is different from JAX. I will raise an error, JAX creates a fallback keyentry.

Differential Revision: [D52547850](https://our.internmc.facebook.com/intern/diff/D52547850/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116786
Approved by: https://github.com/voznesenskym
2024-01-17 07:24:35 +00:00
5667a990fd Chore: improve log message about cache size limit exceeded (#116557)
Fixes #114527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116557
Approved by: https://github.com/ezyang
2024-01-17 06:07:18 +00:00
3cd2c68fbe Fix syntax highlighting in android (#117439)
Hi i have found code blocks are not highlighted properly.

This PR aims to fix that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117439
Approved by: https://github.com/ezyang
2024-01-17 05:17:13 +00:00
735715e6d3 [Dynamo] Make profiler function will be ignored warn only once (#117585)
Fix #111632

#111622 accidentally reverted #111921, we should bring it back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117585
Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/msaroufim
2024-01-17 04:05:45 +00:00
2c5488d719 Match all_gather_into_tensor args names in remapping (#117224)
Fixes #114179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117224
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-01-17 03:50:29 +00:00
8f1bc876b2 [quant] Support custom qmin/qmax for activation and weight for xnnpack quantizer (#117305)
Summary:
att, this allows us to experiment with 4 bit quant in xnnpack

Test Plan:
python test/test_quantization.py -k test_dynamic_linear_int4_weight

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117305
Approved by: https://github.com/digantdesai
2024-01-17 03:22:49 +00:00
e4c2dfb35b [Dynamo, ONNX] Run llama attention with onnxrt and dynamic shapes (#117009)
As title. This PR enables dynamic shapes for running llama with ORT. Both forward and backward are captured as a single graph with this PR.

Summary of changes:
- Test llama attention, llama decoder, llama model to ensure (1) no graph breaks (2) models exported with dynamic shapes with onnxrt dynamo backend
- Reshape SymInt to tensor with shape (1,) to align with the cast done for int in fx_onnx_interpreter.py
- Create an util function to map Python types (e.g., float) to ONNX tensor element type (e.g., onnx.TensorProto.FLOAT).
- Return `hint` for torch.Sym* in type promotion pass.
- Remove _replace_to_copy_with_to since exporter supports aten::_to_copy it now.
- Modify _get_onnx_devices to return CPU device for torch.Sym*.
- Introduce _adjust_scalar_from_fx_to_onnx (e.g., change 0 to tensor(0)) and _adjust_scalar_from_onnx_to_fx (e.g., change tensor(0) to 0) for adjusting scalars when passing values to and receive values from ORT.
- Now, ValueInfoProto of graph inputs (i.e., input_value_infos) are stored and used as `ORT-expected type` when calling `_adjust_scalar_from_fx_to_onnx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117009
Approved by: https://github.com/titaiwangms
2024-01-17 03:02:41 +00:00
fb06ed36d1 Change dynamo_test_failures.py to silently run skipped tests (#117401)
- We silently run skipped tests and then raise a skip message with the
  error message (if any)
- Instead of raising expectedFailure, we raise a skip message with the
  error message (if any)

We log the skip messages in CI, so this will let us read the logs and do
some basic triaging of the failure messages.

Test Plan:
- existing tests. I hope that there are no tests that cause each other
  to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117401
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117391, #117400
2024-01-17 02:48:19 +00:00
9056c7d941 use getPinnedMemoryAllocator for privateuseone (#117530)
Fixes #117482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117530
Approved by: https://github.com/ezyang
2024-01-17 02:33:02 +00:00
8852bb561c More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367)
### Summary
In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review.
At the time, landing that PR asap seemed essential, so I agreed to roll-back that change,

In some cases, more threads can be used than are being used with the current approach.
<strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>.
On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR.
I've also added op-level benchmarks pertaining to example input shapes in this PR.

### Benchmarks

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids)
One socket of 48 physical cores was used, with & without HyperThreading.
Intel OpenMP & tcmalloc were preloaded.

Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all`

#### Already existing benchmarks
|Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)|
|-------------|--------|-------|----------------------------|----------|
|Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03%  |2.705|
|Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380|
|Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199|
|Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259|

#### Some of the following benchmarks are being added in this PR
|Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio  (old/new) |
|-------------|--------|-------|----------------------------|--------------------|
|LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178|
|LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147|
|LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107|
|LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073|
|LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056|
|LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159|
|LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164|
|LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136|
|LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204|
|LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123|
|LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-17 02:26:29 +00:00
4a54ab328c Removed an internal assertion for the optional stable value and inste… (#117414)
…ad defaulted to the standard (=false).

Fixes #117255.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117414
Approved by: https://github.com/ezyang
2024-01-17 02:25:21 +00:00
1872834247 [MPS] Fix torch.mm correctness for large matrices (#117549)
Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K
Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows:
```objc
  NSMutableArray<MPSGraphTensor*>* rows = [NSMutableArray new];
  for (int64_t i = 0; i < M; i += tile_size) {
    const auto i_end = std::min(i + tile_size, M);
    NSMutableArray<MPSGraphTensor*>* row_chunks = [NSMutableArray new];
    for (int64_t j = 0; j < K; j += tile_size) {
      const auto j_end = std::min(j + tile_size, K);
      MPSGraphTensor* tile = nil;
      for (int64_t k = 0; k < N; k += tile_size) {
        const auto k_end = std::min(k + tile_size, N);
        auto selfChunk = [graph sliceTensor:selfTensor
                                     starts:@[ @(i), @(k) ]
                                       ends:@[ @(i_end), @(k_end) ]
                                    strides:@[ @(1), @(1) ]
                                       name:nil];
        auto otherChunk = [graph sliceTensor:otherTensor
                                      starts:@[ @(k), @(j) ]
                                        ends:@[ @(k_end), @(j_end) ]
                                     strides:@[ @(1), @(1) ]
                                        name:nil];
        auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil];

        tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM;
      }
      [row_chunks addObject:tile];
    }
    auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject;
    [rows addObject:row];
  }
  return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject;
```

One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable
Fixes https://github.com/pytorch/pytorch/issues/116769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549
Approved by: https://github.com/kulinseth
2024-01-17 01:33:08 +00:00
f518cf811d [DCP] Adds support for meta tensor loading for DCP.load_state_dict() (#113319)
Currently, DCP requires the `model.state_dict()` to be materialized before passing it to DCP to load, since DCP uses the pre-allocated storage from the initialized model state_dict. Therefore, even for fine-tuning and distributed inference, users would need to explicitly materialize the model on GPU before `DCP.load_state_dict()`.

Today's flow:
```
with torch.device("meta"):
    model2 = parallelize_module(
        MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan
    )

model.to_empty(device='cuda')
state_dict_to_load = model2.state_dict()
DCP.load_state_dict(
    state_dict=state_dict_to_load,
    storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR),
)
model2.load_state_dict(state_dict_to_load)
```

This PR adds support for meta tensor loading. In DCP's planner, when encountering tensors/DTensor on meta device, we initialize tensor/DTensor on the current device on the fly and replace the tensor/DTensor on meta device in the state_dict.  After the change, users no longer needs to manually call `model.to_empty()` when loading existing checkpoints for fine-tuning and distributed inference.

Updated user flow:
```
with torch.device("meta"):
    model2 = parallelize_module(
        MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan
    )
# no longer need to call model.to_empty(device='cuda')
state_dict_to_load = model2.state_dict()
DCP.load_state_dict(
    state_dict=state_dict_to_load,
    storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR),
)
model2.load_state_dict(state_dict_to_load, assign=True)
```

Note that for distributed training, it's still the users' responsibility to reset the parameters (`model.reset_parameters()`) as checkpoint might not exist.

Note that we need to loop thru the state_dict to replace meta tensor/DTensor instead of calling `model.to_empty()` since `DCP.load()` only takes in state_dict but not model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113319
Approved by: https://github.com/fegin, https://github.com/LucasLLC
2024-01-17 00:23:29 +00:00
4a44a3c76d update kineto submodule (#114297)
Rework roctracer shutdown flushing

9365c1aa09

This fixes flaky unit tests that use kineto to verify certain kernels have executed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114297
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-17 00:17:03 +00:00
cf470e7b59 Migrate update-commit-hash to test-infra (#117506)
After https://github.com/pytorch/test-infra/pull/4885, the GHA is now reusable on `test-infra`.  This tests the change and we can also land it after https://github.com/pytorch/test-infra/pull/4885 lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117506
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-17 00:15:04 +00:00
1d14adfa66 [mta] Fused SGD (#116585)
depends on #116583

rel:
- #94791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585
Approved by: https://github.com/janeyx99
2024-01-16 23:54:38 +00:00
5aac95c713 Introduce slice_inverse() op (#117041)
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
    * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)

@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
2024-01-16 23:44:54 +00:00
f6767244cf Added meta function for _upsample_bicubic2d_aa (#117347)
This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127
```
/opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate
    return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors)
E   torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>(*(FakeTensor(..., size=(1, s0, s1, s2)),), **{'size': [s4, floor(s3*s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}):
E   aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers
E
E   from user code:
E      File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image
E       image = interpolate(
E
E   Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
E
E
E   You can suppress this exception and fall back to eager by setting:
E       import torch._dynamo
E       torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347
Approved by: https://github.com/peterbell10
2024-01-16 23:33:55 +00:00
b1c3f9f1b9 Fix missing mkl-dnn include paths (#117492)
Fixes #91968 and #100960
This commit fixes missing  include paths by linking `caffe2_pybind11_state_gpu` against `caffe2::mkldnn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117492
Approved by: https://github.com/ezyang
2024-01-16 23:28:17 +00:00
46a8408fa1 [codemod] markDynamoStrictTest batch 16 (#117218)
[codemod] markDynamoStrictTest test_public_bindings
[codemod] markDynamoStrictTest test_package
[codemod] markDynamoStrictTest test_legacy_vmap
[codemod] markDynamoStrictTest test_namedtensor
[codemod] markDynamoStrictTest test_fx
[codemod] markDynamoStrictTest test_dataloader
[codemod] markDynamoStrictTest test_content_store
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest lazy/test_ts_opinfo
[codemod] markDynamoStrictTest functorch/test_vmap_registrations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
ghstack dependencies: #117553
2024-01-16 23:04:31 +00:00
ab6207a342 Add inductor-specific testing strict mode denylist (#117553)
We have one for Dynamo that currently applies to all "compile"
configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I
don't want to figure out the inductor situation right now, so we're
going to add another denylist for inductor and work through it later.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553
Approved by: https://github.com/voznesenskym
2024-01-16 23:04:31 +00:00
5e0e78585d Ordering placeholder and get_attr nodes in unflattened module (#116910)
Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes.

Before:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
get_attr       bias         bias                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
placeholder    l_x_         l_x_                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

After:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
placeholder    l_x_         l_x_                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
get_attr       bias         bias                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910
Approved by: https://github.com/tugsbayasgalan
2024-01-16 22:58:37 +00:00
4ec667cc64 Revert "[ONNX] Guard xfail tests with error messages (#117425)"
This reverts commit 1993956da33376f34125306209930ed00c486abd.

Reverted https://github.com/pytorch/pytorch/pull/117425 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing in trunk 1993956da3 ([comment](https://github.com/pytorch/pytorch/pull/117425#issuecomment-1894650769))
2024-01-16 22:56:35 +00:00
3a52147cc5 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-16 22:30:04 +00:00
2a3fb7dbb6 [ROCm] Fix NHWC related tests in test_inductor_freezing (#117158)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117158
Approved by: https://github.com/eellison, https://github.com/pruthvistony
2024-01-16 20:48:49 +00:00
4712c7dac8 [inductor] add C-shim for index_put (#116667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116667
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-01-16 20:29:14 +00:00
3e8c8ce37b Update Reviewers for PT-D team (#117409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117409
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/fduwjj
2024-01-16 19:40:41 +00:00
1993956da3 [ONNX] Guard xfail tests with error messages (#117425)
Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (**could be outdated**), and (2) execution of the test (`xfail_if_model_type_is_not_exportedprogram`). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117425
Approved by: https://github.com/thiagocrepaldi
2024-01-16 19:33:51 +00:00
28be47c267 [RELAND][export] Exempt autograd ops for predispatch export (#117448)
Summary: Reland of https://github.com/pytorch/pytorch/pull/116527/files

Test Plan: CI

Differential Revision: D52675324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117448
Approved by: https://github.com/ydwu4
2024-01-16 19:32:15 +00:00
99e54744f7 Fix ExecuTorch pinned commit update failure (#117518)
https://github.com/pytorch/pytorch/pull/117003 shows in interesting failure in which building ExecuTorch runner fails because it needs the change from https://github.com/pytorch/pytorch/pull/117378.  This reveals a chicken-and-egg bug in the job setup where building ExecuTorch runner depends on PyTorch and thus couldn't be part of the Docker image build where PyTorch is not yet available.  The failure happens because an outdated version of PyTorch is there on the Docker image.

So, like vision and audio, the step to build ExecuTorch runner needs to be done during test time.

I also fix the installation of vision and audio in ET job because they are now installed using PyTorch pinned commits as usual after https://github.com/pytorch/executorch/pull/1247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117518
Approved by: https://github.com/larryliu0820, https://github.com/malfet
2024-01-16 18:25:15 +00:00
c30346db0e Check in some torch.compile helper scripts (#117400)
- passrate.py: compute the pass rate
- update_failures.py: update `dynamo_test_failures.py`

Both of these scripts require you to download the test results from CI
locally. Maybe we can automate this more in the future. Checking these
in for now, with no tests :P.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117400
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117391
2024-01-16 17:14:43 +00:00
a7a2773567 Check invariants for dynamo_test_failures.py (#117391)
Test that:
- the xfail list and the skip list don't intersect
- the test names look sane
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117391
Approved by: https://github.com/voznesenskym
2024-01-16 17:14:43 +00:00
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
0fa6ee44d9 [CI] Skip lib for xpu binary unit test (#117514)
Skip .so and .a libraries under build/bin/ for test_xpu_bin in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117514
Approved by: https://github.com/malfet
2024-01-16 12:07:15 +00:00
13473df0d7 [MPS] Make addmm support empty matmul (#117223)
Refactor common part between `mm_out_mps` and `addmm_out_mps` into `do_mm` static function.
Change input placeholder initialization logic in a way that `addmm` can handle matrix multiplication with empty dimension.
Add tests for `mm`+`addmm` with empty tensors to OpInfo but skip addmm with empty matrices from onnx tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117223
Approved by: https://github.com/albanD
2024-01-16 06:46:20 +00:00
28bb31e4a5 [Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358) (#116897)
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.

This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.

This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.

While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116897
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/voznesenskym
2024-01-16 03:57:13 +00:00
f20eaadfef [vision hash update] update the pinned vision hash (#117509)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117509
Approved by: https://github.com/pytorchbot
2024-01-16 03:17:24 +00:00
ae3d7091cb [BE] Replace deprecated set_default_tensor_type (#117505)
Not sure what it was doing there, but replaced it with `set_default_dtype`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117505
Approved by: https://github.com/Skylion007
2024-01-16 02:32:49 +00:00
dd2cff1591 [Dynamo] Use isinstance rather than istype when check if python module type (#117022)
This is to fix a issue from Meta internal use case, where third-party ```DictConfig``` has bug on [```__eq__```](fd730509ef/omegaconf/dictconfig.py (L596)) and it triggers Dynamo error because we are using ```obj in [x, y]``` check. Then I found we can use ```isinstance``` to cover all and removing these special cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117022
Approved by: https://github.com/ckluk2, https://github.com/jansel
2024-01-15 23:25:30 +00:00
bac0878780 Error if compiled nondeterministic backward called in deterministic mode (#114780)
Part of #113707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114780
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-01-15 22:45:40 +00:00
c1ab2777c0 Update state_dict.py to propagate cpu offload (#117453)
Update state_dict.py to propagate cpu offload. It looks like this flag is accidentally ignored?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117453
Approved by: https://github.com/Skylion007
2024-01-15 22:13:37 +00:00
1a57c18760 Fixed cuda grads for interpolate::trilinear on non-contig grad output (#117373)
Fixes #113642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117373
Approved by: https://github.com/lezcano
2024-01-15 18:05:47 +00:00
001585f446 [fx][inductor] Add statically_known_true utility for SymBool (#117359)
This adds a function `statically_known_true` for `SymBool` that works
like inductor's `is_expr_static_and_true`. That is, it tries to simplify the
expression to a constant or returns `False` if it cannot be simplified.

This is useful in cases that can be optimized if the condition is met,
otherwise it doesn't effect correctness so we can avoid adding guards.

I also use this new function in inductor for `FakeTensorUpdater` and
`remove_noop_pass` which both generated unexpected guards previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359
Approved by: https://github.com/lezcano
2024-01-15 18:01:10 +00:00
661747c727 XPU, move oidc to top level workflow and use gha_workflow_s3_and_ecr_read_only policy (#117498)
1. oidc permissions need to be set on top level workflow
2. rename gha_workflow_s3_and_ecr_read_only to gha_workflow_s3_and_ecr_read_only policy which better reflects the policy usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117498
Approved by: https://github.com/chuanqi129, https://github.com/huydhn
2024-01-15 17:46:20 +00:00
7a8013fbfa [inductor] Handle more edge cases in slice and slice_scatter (#117377)
Fixes #117110

When slicing we can end up with start and end which are out of bounds, which is
handled in python slicing by clamping to the correct bounds. There is also the
case where end < start which should result in an empty slice.

In the isoneutral_mixing failure we have the second case, with `start=2, end=0`
which in `slice_scatter` became `src_size[dim] = -2`.

This PR improves slice's edge case handling and factors the start and end
normalization code out so it can be shared with slice_scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377
Approved by: https://github.com/lezcano
2024-01-15 17:05:48 +00:00
5c700f60a5 Properly preserve SymInt input invariant when splitting graphs (#117406)
Fixes https://github.com/pytorch/pytorch/issues/111636
Fixes https://github.com/pytorch/pytorch/issues/108877
Fixes https://github.com/pytorch/pytorch/issues/116956

Inductor has an invariant that every dynamic shape symbol s0, s1, etc. which is referenced by an input tensor must also be passed in explicitly as an argument. It has some capability of reverse engineering symbols if it's obvious how to get them (e.g., if you pass in `arg: f32[s0, 4]` it will know that it can retrieve `s0 = arg.size(0)`) but in full generality it is not always possible to derive this (e.g., if the only mention of s0 is in `arg2: f32[s0 + s1, 4]`).  However, the graph splitter used by optimize_ddp did not respect this invariant. This PR makes it respect it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117406
Approved by: https://github.com/wconstab
2024-01-15 15:04:57 +00:00
75818adcf7 Pyi doc inclusion + fix (#117267)
Reland of https://github.com/pytorch/pytorch/pull/114705 with extra fix to smoothly handle when the modules we're trying to load are not available (and thus the pyi won't contain the docs in this case).

Tested locally that it works properly in fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117267
Approved by: https://github.com/ezyang
2024-01-15 13:06:53 +00:00
7a851fedc8 support torch.mm with conjugate transposed inputs (#117238)
Fix https://github.com/pytorch/pytorch/issues/116855.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117238
Approved by: https://github.com/lezcano
2024-01-15 12:36:01 +00:00
41ffea2f99 Properly unwrap_storage tensors sent to DynamicScalar (#117444)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117444
Approved by: https://github.com/Skylion007
2024-01-15 12:15:04 +00:00
d9b265adaf modify the conditions as PythonModuleVariable (#116856)
## Motivation
The current code of `value in [torch.backends.cudnn, torch.ops]` requires `value` to have the implementation of `__eq__`. If the value is a custom object and does not implement `__eq__`, dynamo will throw error. For example, ConvolutionOpContext, the custom 'torch._C.ScriptClass' object registered in IPEX, dynamo will throw the following error:

**torch._dynamo.exc.InternalTorchDynamoError: '__eq__' is not implemented for __torch__.torch.classes.ipex_prepack.ConvolutionOpContext**

I think this is a common issue, To avoid this issue, the PR replaces the current code `value in [torch.backends.cudnn, torch.ops]`with `isinstance(value, (torch.backends.cudnn.CudnnModule, torch._ops._Ops)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116856
Approved by: https://github.com/jansel
2024-01-15 11:10:57 +00:00
d089bb1b72 [xla hash update] update the pinned xla hash (#117485)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117485
Approved by: https://github.com/pytorchbot
2024-01-15 10:33:18 +00:00
2b56d80460 [inductor][cpp] apply simplify_index_in_vec_range to vector store and vector transpose (#117263)
As the title, this PR extends the `simplify_index_in_vec_range` to store and transpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117263
Approved by: https://github.com/jansel
ghstack dependencies: #117221, #117260
2024-01-15 08:41:28 +00:00
3b00dd5843 [inductor][cpp] apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260)
For the one of the kernels in the UT `test_vec_contiguous_ModularIndexing`:
Before:
```c++
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                        #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<float, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = in_ptr0[static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (256L*x1_inner) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L)))];
                                }
                                return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.mean.store(out_ptr0 + static_cast<long>(x1 + (28L*x0)));
                        tmp_acc0_vec.m2.store(out_ptr1 + static_cast<long>(x1 + (28L*x0)));
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L))
                {
                    {
                        #pragma omp declare reduction(    welford:Welford<float>:    omp_out = welford_combine(omp_out, omp_in))     initializer(omp_priv={Welford<float>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L)))];
                            tmp_acc0 = welford_combine(tmp_acc0, tmp0);
                        }
                        out_ptr0[static_cast<long>(x1 + (28L*x0))] = tmp_acc0.mean;
                        out_ptr1[static_cast<long>(x1 + (28L*x0))] = tmp_acc0.m2;
                    }
                }
```

After:
```c++
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L))
                {
                    {
                        #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                        #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L))));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
```

This PR also further speeds up the model `swin_base_patch4_window7_224` from 1.25x to 1.28x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117260
Approved by: https://github.com/jansel
ghstack dependencies: #117221
2024-01-15 06:57:25 +00:00
3a0bcd2c12 [audio hash update] update the pinned audio hash (#117423)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117423
Approved by: https://github.com/pytorchbot
2024-01-15 05:50:51 +00:00
19502ff6aa Fixed typo in build_activation_images.py (#117458)
In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458
Approved by: https://github.com/malfet
2024-01-15 03:27:40 +00:00
03c6f79548 [vision hash update] update the pinned vision hash (#117311)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117311
Approved by: https://github.com/pytorchbot
2024-01-15 03:15:20 +00:00
2200118f59 Enable some uint{16,32,64} tests that are working (#116809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809
Approved by: https://github.com/albanD
2024-01-15 02:25:21 +00:00
a298fba146 [MPS] Increase metal language support to 2.3 (#117472)
As Conda binaries are still built on MacOS 12, which renders MPS unusable after https://github.com/pytorch/pytorch/pull/116942

Test plan:
```
 % xcrun -sdk macosx metal --std=macos-metal2.3 -Wall -o Index Index.metal
 % xcrun -sdk macosx metal --std=macos-metal2.2 -Wall -o Index Index.metal
Index.metal:167:1: error: type 'const constant ulong3 *' is not valid for attribute 'buffer'
REGISTER_INDEX_OP_ALL_DTYPES(select);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Index.metal:159:5: note: expanded from macro 'REGISTER_INDEX_OP_ALL_DTYPES'
    REGISTER_INDEX_OP(8bit,  idx64, char,  INDEX_OP_TYPE, ulong3);    \
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
```

Fixes https://github.com/pytorch/pytorch/issues/117465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117472
Approved by: https://github.com/xuzhao9
2024-01-15 01:16:52 +00:00
61a181e83c Report function name in stack trace annotations (#117459)
When working with internal flows, it can sometimes be ambiguous what
version of the code they are working with.  In this case, having the
function name available in the stack trace can help identify what you
are looking at.

Example now looks like:

```
[DEBUG]         # File: /data/users/ezyang/a/pytorch/a.py:5 in f, code: return x + x
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117459
Approved by: https://github.com/Skylion007
2024-01-15 00:29:13 +00:00
a6d33614d6 add float8 types to dtypes table (#117375)
Summary:

As titled

Test Plan:

CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375
Approved by: https://github.com/ezyang
2024-01-15 00:23:07 +00:00
c3e2b94827 Realize non-ReinterpretView Views in custom Triton kernel args (#117468)
Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input.

This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro).

Test Plan:

```
$ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input
...
----------------------------------------------------------------------
Ran 1 test in 3.909s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468
Approved by: https://github.com/oulgen
2024-01-14 23:31:38 +00:00
62496ffd0d [dynamo][easy]: Add support for operator.truth (#117463)
* This is an old builtin function equivalent to the bool constructor. it is easy enough to add support for.
* I also realized the tests were in the wrong class (the one reserved for testing default args) so I moved them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117463
Approved by: https://github.com/jansel
2024-01-14 19:08:31 +00:00
2748f05056 Add torch.fx.interpreter to uninteresting_files (#117460)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117460
Approved by: https://github.com/Skylion007
2024-01-14 18:35:21 +00:00
a1155883d4 Clean up Docker config on ROCm runner (#117432)
This fixes the issues on trunk when logging in to ECR on ROCm runner is failing.  During my test, it's also ok to fail the login part with that `not implemented` error https://github.com/pytorch/pytorch/actions/runs/7516446579/job/20461801473, and pulling the image from ECR still works, so I set `continue-on-error: true` on the step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117432
Approved by: https://github.com/malfet
2024-01-14 18:27:09 +00:00
a76610e6fb [BE] Delete unused is_dynamo_compiling (#117455)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117455
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
ghstack dependencies: #117451, #117452, #117454
2024-01-14 15:15:29 +00:00
347255809c Make c10::SymInt typecaster support scalar-like fake tensor (#117454)
We can use `__index__` to do this conversion because that will trigger a
guard on data dependent SymInt if the tensor is a fake tensor, but if
we fetch item directly and put it in the Scalar, we may still be able to
make it work out.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117454
Approved by: https://github.com/yanboliang
ghstack dependencies: #117451, #117452
2024-01-14 15:15:29 +00:00
796fe40a96 [BE] Delete unnecessary variable fastpath (#117452)
This fastpath is unnecessary because in the logic below we
do the same thing:

```
        auto& var = THPVariable_Unpack(obj);
        if (var.numel() != 1 ||
            !at::isIntegralType(
                var.dtype().toScalarType(), /*include_bool*/ true)) {
          throw_intlist_exception(this, i, obj, idx);
        }
        auto scalar = var.item();
        TORCH_CHECK(scalar.isIntegral(/*include bool*/ false));
        res.push_back(scalar.toSymInt())
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117452
Approved by: https://github.com/yanboliang
ghstack dependencies: #117451
2024-01-14 14:39:46 +00:00
220cf46c2a Always accept 0-d scalar tensors as int, even if __index__ fails (#117451)
Fixes https://github.com/pytorch/pytorch/issues/117288

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117451
Approved by: https://github.com/yanboliang
2024-01-14 14:39:46 +00:00
38c18f3825 [c10d] Add a timeout check interval variable for timeout dump (#117093)
The current timeout check frequency is relied on monitoring thread's timeout thread which can be too long (even if we set it to 2mins) so let's use a separate timeout variable which users can configure it. And we only only let default PG to check TCPStore so even more frequent check should be fine. (Our stress test is performed on every half second).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117093
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-01-14 02:33:17 +00:00
003c900d5e Add _assert_scalar (#117378)
Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378
Approved by: https://github.com/jansel
2024-01-14 00:50:36 +00:00
1a8545164a [export] Add unit test for SDPA export result (#117390)
Summary:

A follow up for #117097. In that PR I didn't add
`_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition
table. This PR does that and also add a unit test.

Test Plan: python test/export/test_export.py -k
test_scaled_dot_product_attention

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390
Approved by: https://github.com/drisspg
2024-01-14 00:21:28 +00:00
bf27dd6df9 Add dynamo support for operator.abs (#117442)
A test case for operator.abs and allows for constant folding with it. Partially applies to #116396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117442
Approved by: https://github.com/jansel, https://github.com/malfet
2024-01-13 21:38:55 +00:00
1a790f5a61 [RELAND] Error grad mode op in export API (#117420)
Summary: Title

Test Plan: CI

Differential Revision: D52706691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117420
Approved by: https://github.com/angelayi
2024-01-13 21:36:29 +00:00
d6847c5977 [CI] Set correct permissions for auto_request_review (#117408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117408
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2024-01-13 20:02:03 +00:00
53f3361319 [BE] Use nested namespaces for sparse (#117415)
C++17 is fu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117415
Approved by: https://github.com/Skylion007
2024-01-13 19:51:28 +00:00
d8bdb50379 [reland] pass shape/stride during tensor unflatten (#117340)
Reland of https://github.com/pytorch/pytorch/pull/113547 as the previous
PR reverted bc of torch.compile symbolic shape issue. Since we now disabled tensor
unflatten with dynamo.disable, we should not hit this issue again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117340
Approved by: https://github.com/Skylion007
ghstack dependencies: #117336
2024-01-13 19:33:47 +00:00
eebf115686 [fsdp][2d] FSDP sync module states handle tensor subclass (#117336)
This PR adds the ability to let FSDP sync module states kwarg to handle
tensor subclass, because FSDP works on the "dp" mesh dimension, as long
as FSDP works on a different device mesh dimension, we can safety let
FSDP just broadcast the DTensor local shards.

fixes https://github.com/pytorch/pytorch/issues/117126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117336
Approved by: https://github.com/awgu
2024-01-13 19:33:47 +00:00
fc044b5cdb [pt-vulkan] Add build time flag to control descriptor pool sizes (#117398)
Summary:
## Context

When running large models with a lot of operators, the default descriptor pool allocated by the Vulkan compute API may run out of descriptor sets. This changeset introduces the `VULKAN_DESCRIPTOR_POOL_SIZE` build variable (which will default to `1024u`) which can allow for a larger descriptor pool to be allocated if necessary.

## Notes for Reviewers

This is a simple stopgap solution until we have bandwidth to implement the more general solution, which would be to modify the `DescriptorPool` class defined in `api/Descriptor.[h,cpp]` to automatically allocate a new descriptor pool when memory runs out. However, I would consider this change to be low priority since with a delegate/graph mode of execution, the descriptor pool can often be allocated to exactly fit a model's requirements.

Test Plan:
There should be no functional changes under default build settings. Run `vulkan_api_test` to make sure everything works as before; CI should test for that as well.

```
# On devserver
LD_LIBRARY_PATH=/home/ssjia/Github/swiftshader_prebuilt/swiftshader/build/bin/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*"
```

Reviewed By: yipjustin, jorgep31415

Differential Revision: D52742140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117398
Approved by: https://github.com/yipjustin
2024-01-13 13:11:00 +00:00
2c8975387d [Optimus] fix batch layernorm numerical issue (#117404)
Summary:
Fix the numerical issue with addcmul.

Found that torch.addcmul will generate different value from torch.add+torch.mul with 32 bit check. Mini repro: N4823658

Change addcmul tp torch.add+torch.mm

Test Plan:
buck test

before change
```
the diff index is:  0
the diff index is:  1
the diff index is:  6
```

after change numeric on par

Differential Revision: D52745671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117404
Approved by: https://github.com/mengluy0125
2024-01-13 10:04:12 +00:00
f008efa8e7 Reconstruct streams via global registration, temporary impl to unblock FSDP (#117386)
This is a placeholder implementation for reconstructing streams via global storage to unblock FSDP, pending proper stream support design

This PR does a few things:

1) fixes registration for devices with indices. We were only supporting "cuda", we now support "cuda:k" interfaces where k is # of gpu

2) Changes the stream objects in dynamo to take devices as device types, instead of strings, and updates the string based device APIs to gracefully take device types.

3) Introduces a reconstruct-by-global (using existing cleanup hook structures) to streams as a placeholder impl for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117386
Approved by: https://github.com/jansel
2024-01-13 07:03:33 +00:00
ef3217d9f7 [PyTorch] Mark USDT probes as noinline to avoid duplications in ThinLTO mode (#117381)
Differential Revision: D52710343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117381
Approved by: https://github.com/chaekit
2024-01-13 06:18:01 +00:00
302f931c25 Update Reviewers for PyTorch Distributed team (#116231)
Update merge rule approver list under 'Distributed' section based on current PyTorch distributed team composition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116231
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
2024-01-13 05:07:13 +00:00
96163eb010 Switch nightly binaries to oidc. Remove aws keys (#117416)
This should fix all wheel nightly upload failures:
https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=upload
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117416
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-13 03:24:13 +00:00
22ddf91dbb [torch][fx] more strong typed codegen for partial specialized code on boolean (#117201)
Summary:
* in some fx partial specialized codegen via `concrete_args` on boolean arguments, we extend to further use the graphmodule on strong typed runtime like torchscript.
* this diff fix the type annotation for boolean only and preserve argument mapping for leafing pytree nodes.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:fx -- --exact 'caffe2/test:fx - test_partial_trace (test_fx.TestFX)'

Differential Revision: D52667883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117201
Approved by: https://github.com/houseroad
2024-01-13 03:10:02 +00:00
2bc7da1ab7 [HigherOrderOp] change signature of map_impl (#117161)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1580

This PR changes the schema of map_impl from map_impl(f, num_mapped, *operands) to map_impl(f, mapped_args: Tuple, moperands: Tuple). This is to prepare for turning on dynamo for eager mode map, where we want to get rid of the num_mapped scalar.

Test Plan: Existing tests.

Differential Revision: D52495413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117161
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-01-13 02:50:46 +00:00
f2f47c6848 [dynamo] realize LazyVT's in DICT_MERGE (#117282)
Fixes https://github.com/pytorch/pytorch/issues/115029.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117282
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-13 01:50:39 +00:00
3e397cefc5 Add uint1 to uint7 dtypes (#117208)
Summary:
These dtypes are added since we see more demand for these sub byte dtypes, especially with
the popularity of LLMs (https://pytorch.org/blog/accelerating-generative-ai-2/#step-4-reducing-the-size-of-the-weights-even-more-with-int4-quantization-and-gptq-2021-toks)

Note these are just placeholders, the operator support for these dtypes will be implemented with tensor subclass.
e.g. torch.empty(..., dtype=torch.uint1) will return a tensor subclass of uint1, that supports different operations like bitwsise ops, add, mul etc. (will be added later)

Also Note that these are not quantized data types, we'll implement quantization logic with tensor subclass backed up by these dtypes as well.
e.g `Int4GroupedQuantization(torch.Tensor)` will be implemented with torch.uint4 Tensors (see https://github.com/pytorch-labs/ao/pull/13 as an example)

Test Plan:
CIs
python test/test_quantization.py -k test_uint1_7_dtype

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117208
Approved by: https://github.com/ezyang
2024-01-13 01:09:23 +00:00
52575eb1bb The permission id-token write needs to be set on rocm-test callers (#117422)
All these workflows lack the necessary permission to run `_rocm-test` job after https://github.com/pytorch/pytorch/pull/117160, for example https://github.com/pytorch/pytorch/actions/runs/7508520071

### Testing

Confirm that trunk is back https://github.com/pytorch/pytorch/actions/runs/7508830196.  Other workflows would be the same, i.e. rocm https://github.com/pytorch/pytorch/actions/runs/7508830137/job/20444989127.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117422
Approved by: https://github.com/atalman
2024-01-13 00:27:46 +00:00
9746f36e50 [export] Minor fixes to serialization (#117374)
* Checks that the input to torch.export.save is an ExportedProgram (https://github.com/pytorch/pytorch/issues/116952)
* Fixes naming for serialized state dict from `serialized_state_dict.json` to `serialized_state_dict.pt` (https://github.com/pytorch/pytorch/issues/116949)
* Moves some tests to be expectFailure rather than blocklisted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117374
Approved by: https://github.com/ydwu4
2024-01-13 00:23:06 +00:00
7f1f0b1135 [C10D] Add duration_ms to flight recorder (#114817)
Measures the duration of a collective operation using nccl start/end
events and includes this duration (in ms) in the flight recorder data.

duration_ms will be an optional field, since it only works when
timing is enabled.  Currently timing is enabled when flight recorder
is enabled, but this is not a strict requirement.  Duration is also
not available for collectives not in a completed state.

Note: computing duration can lead to a hang due to calling cudaEventDuration when
the cuda driver queue is full.

We don't ever want dump() api to hang, since we might want dump to help
debug a hang. Hence, we only query durations from the watchdog thread,
and it's possible during dump() call, some of the most recent
collectives durations won't have been computed yet at time of dump.  We
make this tradeoff to ensure that dump() itself will never hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817
Approved by: https://github.com/fduwjj, https://github.com/zdevito
ghstack dependencies: #116905
2024-01-12 23:34:11 +00:00
7a7535283f Some basic support for uint{16,32,64} codegen in CPU inductor (#116810)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116810
Approved by: https://github.com/chenyang78, https://github.com/eellison, https://github.com/desertfire
2024-01-12 23:13:28 +00:00
4b25948ee6 Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332)
- Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp`
- Append rank name to traces to avoid all ranks trying to create the same file
- Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332
Approved by: https://github.com/H-Huang, https://github.com/wconstab
2024-01-12 22:41:09 +00:00
c329eddcb9 Migrate the rest of state_dict testing to OptimizerInfo (#117186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117186
Approved by: https://github.com/albanD
ghstack dependencies: #116509
2024-01-12 22:32:37 +00:00
bcf1f312a0 Migrate nontensor step and CUDA params state_dict tests to OptimizerInfo (#116509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116509
Approved by: https://github.com/albanD
2024-01-12 22:32:37 +00:00
7b753cc7b8 Skip some slow tests (under Dynamo) (#117389)
Otherwise these may cause timeouts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117389
Approved by: https://github.com/jerryzh168, https://github.com/voznesenskym
ghstack dependencies: #117318, #117320
2024-01-12 22:18:07 +00:00
d73846689d Rename test_legacy_vmap.py TestCase names (#117320)
The problem is that the dynamo_test_failures logic recognizes tests by
their TestClass.test_name. Unfortunately we have duplicate
TestClass.test_name in test_legacy_vmap and test_vmap. This PR
unduplicates them.

Something more robust would have been to include the test file name in
the dynamo_test_failures logic, but... it's a bit too late for that. We
can fix it if it becomes more of a problem in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117320
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117318
2024-01-12 22:18:07 +00:00
06576d859d Stop running ModuleInfo tests under Dynamo (#117318)
This is a policy decision, similar to the OpInfo one. The problem is
that they just take too long to run when we reset() before and after
each.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117318
Approved by: https://github.com/voznesenskym
2024-01-12 22:17:59 +00:00
fbd9bccb75 [C10D](reland) Add GIL checker to NCCL watchdog monitor (#117312)
Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck.

One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time.

If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully.

Reland: uses a function pointer to avoid destructor ordering issues on dlclose. (Looks like the destructor for the std::function was being run later than the libtorchpython lib was unloaded, leading to a crash).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117312
Approved by: https://github.com/zdevito
2024-01-12 21:48:45 +00:00
7b0926cc3e Fix wrong class inheritance in pyi (#116404)
As the title stated.

f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404
Approved by: https://github.com/ezyang, https://github.com/wconstab
2024-01-12 21:25:29 +00:00
c167c34396 Skip unsupported tests on arm (#117344)
add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below:
```
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback
    c = gather_traceback(True, True, True)
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344
Approved by: https://github.com/malfet, https://github.com/drisspg
2024-01-12 21:12:11 +00:00
384c4885fa [ProcessGroup] Do not print NCCL_DEBUG before NCCL init (#117328)
In case /etc/nccl.conf is used, `NCCL_DEBUG` is not set to sys env until NCCL inits.
The deleted print point is before NCCL inits, hence may be inaccurate.
This PR removes it and relies on the other print point which is after NCCL comm creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117328
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-01-12 20:46:50 +00:00
18bd5c05bc FFT: Handle noop fftn calls gracefully (#117368)
Fixes #117252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117368
Approved by: https://github.com/malfet
2024-01-12 20:16:50 +00:00
5cf481d1ac [CI] Explicitly specify read-all permissions on the token (#117290)
Would be nice to have it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117290
Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/huydhn, https://github.com/atalman
2024-01-12 19:15:54 +00:00
013a59acbd Update BCEWithLogitsLoss documentation regarding pos_weight (#117046)
Added clarification for the example provided for the pos_weight parameter in the BCEWithLogitsLoss class, particularly in multi-label binary classification context. This enhancement addresses potential misunderstandings about the application of 'binary' classification, which typically implies two classes, to scenarios involving multiple classes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117046
Approved by: https://github.com/mikaylagawarecki
2024-01-12 18:26:25 +00:00
e54b40e5eb [dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-12 18:21:14 +00:00
657545dbdd Migrate rocm test to using oidc (#117160)
Similar to Intel XPU, lets use OIDC for rocm runners.

Refer to this PR: https://github.com/pytorch/pytorch/pull/116554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117160
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-12 17:57:26 +00:00
cb42bc705b Make auto_functionalized HOP fallback in inductor (#117084)
It looks like the inductor fallback previously worked with HOPs but no longer
does, so I fixed that:
- all HOPs are exposed under torch.ops.higher_order, so I changed how
  inductor looks them up
- the inductor fallback assumed that an operator's signature was (*args,
  **kwargs). This is true for all the OpOverloads but not HOPs. I
  rewrote the code to not rely on this.

Test Plan:
- existing tests
- new test for auto_functionalized HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084
Approved by: https://github.com/williamwen42
2024-01-12 17:57:01 +00:00
a97d00cca5 [Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445)
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
2024-01-12 17:30:40 +00:00
21d370819b [CI] Set permissions for stale workflow (#117371)
Hopefully should fix failures one observes in HUD as default permissions for the repo were changed to read-only
<img width="232" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/4047472c-ca3c-4288-add7-97f0ce43106a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117371
Approved by: https://github.com/clee2000
2024-01-12 16:44:15 +00:00
172dd13ecf [inductor][cpp] improve vector contiguous checks for FloorDiv and ModularIndexing (#117221)
Fix https://github.com/pytorch/pytorch/issues/114488

The PR tries to enable contiguous vector loads for cases where we can reduce `FloorDiv` and `ModularIndexing` in the vectorized loop.

Take the index expression in test case `test_vec_contiguous_ModularIndexing` for example.
`14336*x0 + 256*x1 + 128*((x2//256)) + ModularIndexing(x2, 1, 128) + 7168*ModularIndexing(x2, 128, 2)` can be reduced to `14336*x0 + 256*x1 + x2 + 128*x2_div_c0 + 7168*x2_mod_c0 + x2_mod_c1` where `x2` is a vectorized loop variable and the vector length is 16. This means we can do vectorized load for this index. Check the code comment for more details:
https://github.com/pytorch/pytorch/pull/117221/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R317-R329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117221
Approved by: https://github.com/jansel
2024-01-12 15:20:36 +00:00
6c624aad37 [CPU] Disable floating-point contraction when compiling (#116318)
Fixes #100775.

For CPU inductor path, disable -ffp-contract, such as fma, from optimization flags to fix functional issues.

### Validation
Validation on 3 benchmark suites.

- [x] FP32: Negligible geomean change; No outlier models.

<img width="582" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/7c14a8b8-eb6c-4794-bff9-2e1ae3a22781">

- [x] BF16: Negligible geomean change; No outlier models.

<img width="589" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/cf558737-8cb2-411f-8761-27b9f8fc43af">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116318
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-12 14:09:05 +00:00
6ebb26d572 Fail Conv Binary Inplace check when act and accum are same tensor (#117331)
**Summary**
When a tensor is used as the act of conv and extra input of the binary add node, we shouldn't do conv binary inplace fusion.
```
      a
    /   \
 conv
   \
     add
```

**TestPlan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117331
Approved by: https://github.com/jgong5
ghstack dependencies: #117330
2024-01-12 10:34:11 +00:00
19a9fdbf3a Add more alias and mutation check for other input of Conv Binary Inplace fusion (#117330)
**Summary**
Fix the issue: https://github.com/pytorch/pytorch/issues/117108.
Use the outplace conv binary fusion when other input is with type `TensorBox(View(ReinterpretView()))` since other input is a view of some other tensor.

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117330
Approved by: https://github.com/jgong5
2024-01-12 10:29:33 +00:00
f7d9047864 [inductor] Iterative percolate tags (#117306)
Fixes https://github.com/pytorch/pytorch/issues/116581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117306
Approved by: https://github.com/aorenste, https://github.com/eellison
2024-01-12 07:52:32 +00:00
47c9d12ffd Add super().setUp() to TestFFT1D (#117329)
One day I'll move the check to be somewhere else so we don't need to worry about this anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117329
Approved by: https://github.com/huydhn
2024-01-12 07:47:01 +00:00
50049cfaa0 [1/4] Intel GPU Runtime Upstreaming for Device (#116019)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.

# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
  - `c10::xpu::device_count`
  - `c10::xpu::set_device`
  - ...

# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-01-12 07:36:25 +00:00
7dac2f9f2d [export][ez] Fix getting meta["val"] (#117313)
Summary: For integer inputs, they do not have a meta["val"].

Test Plan: `buck run @//mode/dev-nosan  //executorch/examples/portable/scripts:export -- -m emformer_predict` passes the export step

Differential Revision: D52716419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117313
Approved by: https://github.com/kirklandsign, https://github.com/tugsbayasgalan
2024-01-12 06:17:38 +00:00
40f12cec93 Change predispatch tracing API (#117278)
Summary: Change the API used in export for aotinductor

Test Plan: buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_group_batch_fusion_fb

Differential Revision: D52678653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117278
Approved by: https://github.com/angelayi, https://github.com/khabinov
2024-01-12 06:10:02 +00:00
ec443089c7 enable fp16 mkldnn fusion/prepack in inductor (#117206)
- Extend `linear/conv/rnn` packable with `float16`.
- Extend `Unary fusion` to support `float16`.

Test Case:
    Extend bfloat16 related test in `test_cpu_repro.py` and `test_mkldnn_pattern_matcher.py` to test both `fp16` and `bf16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117206
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-12 06:08:42 +00:00
9d5954e2a9 ignore ill-formed solution of reduce_inequalities (#117310)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/117033

Sometimes the solution returned by `sympy.solvers.inequalities.reduce_inequalities` can contain sub-expressions of the form `CRootOf(...)`, denoting the complex root of some equation in `x`, where `x` is an arbitrary symbol. We will now gracefully fail when this happens, like we already do when the solver itself fails.

Test Plan: added a test

Differential Revision: D52715578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117310
Approved by: https://github.com/ezyang
2024-01-12 06:01:13 +00:00
638f85fd67 Add default parameters to rrelu_with_noise() (#117141)
Summary:
rrelu_with_noise() was listed as having default parameters in the schema but the
actual code definition didn't have them.

The failing example was calling rrelu() which DOES have default parameters and
it passes those defaulted values to C++. Under the covers the C code was calling
the python version of rrelu_with_noise().

Although the C++ code was passing all the values to the python version of
rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema
and strips any parameters which match the schema's listed defaults so if the
schema shows defaults that aren't in the code it will be a problem.

Test Plan:
I added a unit test for this specific case. It would probably be better to write
a more general one to validate all the ops against their schemas - but I haven't
learned enough about the test harness to do that yet.

Fixes #115811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141
Approved by: https://github.com/yanboliang, https://github.com/oulgen
2024-01-12 05:32:13 +00:00
d29bf0a37e Fix ONNXProgram.save to use torch.load(..., mmap=True) for large models (#117295)
During ONNXProgram.save, the implicit/explicit state_dict passed in must
be loaded in memory in order to read each initializer and create an
external tensor proto with them

This PR ensures torch.load uses memory-map to support large models that
cannot fit in memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117295
Approved by: https://github.com/BowenBao
ghstack dependencies: #117294
2024-01-12 04:38:27 +00:00
b62ba82cdc Update initializer path for ONNXProgram.save due to onnx.checker limitation (#117294)
According to https://github.com/onnx/onnx/blob/main/docs/ExternalData.md#large-models-2gb when initializers are larger than 2GB, `onnx.checker` requires the model to be in the same directory as the initializer.

Although not strictly necessary for the export and model save to succeed, it is desirable to have the `onnx.checker` to succeed when validation the resulting large model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117294
Approved by: https://github.com/BowenBao
2024-01-12 04:22:12 +00:00
b3b585af64 Revert "[codemod] markDynamoStrictTest batch 16 (#117218)"
This reverts commit 47119785acbfe20d9ef6cf5d90887a441402f5c7.

Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/zou3519 due to just felt like reverting this ([comment](https://github.com/pytorch/pytorch/pull/117218#issuecomment-1888360366))
2024-01-12 03:06:20 +00:00
ac0bed01df Revert "[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138)"
This reverts commit c278a1b39c8ae33feaa4a87b35b721fff7fdf19a.

Reverted https://github.com/pytorch/pytorch/pull/117138 on behalf of https://github.com/zou3519 due to Broke jobs on main, I'm not sure why ([comment](https://github.com/pytorch/pytorch/pull/117138#issuecomment-1888290068))
2024-01-12 01:55:49 +00:00
3214ada631 [MPS][BE] Better format nested ternary (#117198)
- Replace double ternary with if + ternary
- Replace deprecated `AT_ASSERT` with `TORCH_INTERNAL_ASSERT`
- Replace regular asserts with `TORCH_CHECK` or `TORCH_INTERNAL_ASSERT` depending on context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117198
Approved by: https://github.com/Skylion007
2024-01-12 01:29:17 +00:00
04604eea8a [inductor] check nan/inf for graph inputs (#117189)
This is split out from #103469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117189
Approved by: https://github.com/jansel
2024-01-12 00:59:32 +00:00
47119785ac [codemod] markDynamoStrictTest batch 16 (#117218)
[codemod] markDynamoStrictTest test_dataloader
[codemod] markDynamoStrictTest test_public_bindings
[codemod] markDynamoStrictTest test_namedtensor
[codemod] markDynamoStrictTest test_fx
[codemod] markDynamoStrictTest test_content_store
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest lazy/test_ts_opinfo
[codemod] markDynamoStrictTest functorch/test_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218
Approved by: https://github.com/bdhirsh
2024-01-12 00:32:36 +00:00
c278a1b39c [dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138
Approved by: https://github.com/jansel
2024-01-11 23:26:25 +00:00
5d2d21a7be [bfloat16][easy] kthvalue, median (#117279)
Fixes #109991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117279
Approved by: https://github.com/Skylion007
2024-01-11 22:44:07 +00:00
5c6e7962f4 [c10d][EZ] Add more logs in the destructor of ProcessGroupNCCL for better root cause investigation (#117291)
Add logs to the place where we inspect whether a hang happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117291
Approved by: https://github.com/XilunWu, https://github.com/shuqiangzhang
2024-01-11 22:33:30 +00:00
53cba40651 [Distributed] Fix tests when CUDA not available (#117163)
NCCL tests failed after https://github.com/pytorch/pytorch/pull/116217 when PyTorch was not built with CUDA. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117163
Approved by: https://github.com/malfet, https://github.com/wanchaol
2024-01-11 22:27:43 +00:00
9f87760160 Revert "[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445)"
This reverts commit e55a778cbb518e54c5afa5b8107b352746d7f41a.

Reverted https://github.com/pytorch/pytorch/pull/116445 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but i see it fails ROCm test in trunk due to an unsupported use case e55a778cbb ([comment](https://github.com/pytorch/pytorch/pull/116445#issuecomment-1888060036))
2024-01-11 22:21:45 +00:00
0a5aa5c2d1 [pt-vulkan][ez] Remove reference to c10::MemoryFormat from api/ folder (#117183)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset removes references to `c10::MemoryFormat` in `api/Tensor.[h,cpp]`; when constructing a `vTensor`, the `api::StorageType` (i.e. whether the tensor will be backed by buffer or texture storage) and `api::GPUMemoryLayout` (i.e. which dimension will be the fastest moving dimension) must be specified directly.

Differential Revision: [D52662234](https://our.internmc.facebook.com/intern/diff/D52662234/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117183
Approved by: https://github.com/liuk22, https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179, #117180, #117181
2024-01-11 22:08:29 +00:00
8b0bfb3aaa [FSDP] remove unused flat_param_part_view (#117082)
flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x

it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497

before that, the original code is below. Since flat_param is 1D, we do
not need .view for reshaping

```
self.flat_param.data = padded_unsharded_flat_param[
    : unsharded_size.numel()
].view(
    unsharded_size
)
```

unit test: pytest test/distributed/fsdp/test_fsdp_core.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082
Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007
2024-01-11 21:59:51 +00:00
3c66c89057 [pt-vulkan] Replace c10::ScalarType with native equivalent (#117181)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset introduces `api::ScalarType` in `api/Types.h`, which is intended to function the same as `c10::ScalarType`; thus `api/Types.h` is the primary file of interest. The rest of the changes are straightforward replacements of `c10::ScalarType` with `api::ScalarType`.

Differential Revision: [D52662237](https://our.internmc.facebook.com/intern/diff/D52662237/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117181
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179, #117180
2024-01-11 21:43:33 +00:00
331ae7f89f [pt-vulkan][ez] Replace c10::overflows with native equivalent (#117180)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset is very straightforward, as it simply copies the required components of `c10::overflows` from [`c10/util/Half.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Half.h#L477) into `api/Utils.h`.

Differential Revision: [D52662236](https://our.internmc.facebook.com/intern/diff/D52662236/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117180
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179
2024-01-11 21:43:33 +00:00
4205892be6 [pt-vulkan][ez] Replace ArrayRef with std::vector<T>& (#117179)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset replaces all instances of `c10::ArrayRef<T>` with `std::vector<T>&` and all instances of`c10::IntArrayRef` with `std::vector<int64_t>&`. There are a lot of changes in this changeset but that is simply due to the large number of callsites. All the changes are straightforward replacements.

Differential Revision: [D52662235](https://our.internmc.facebook.com/intern/diff/D52662235/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117179
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178
2024-01-11 21:43:15 +00:00
b209de6699 [pt-vulkan] Replace TORCH_CHECK and similar macros with native equivalents (#117178)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset introduces `api::Error` class in `api/Exception.h`, which is a more barebones copy of the `c10::Error` class from [`c10/util/Exception.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Exception.h). The macros `VK_CHECK_COND` (equivalent to `TORCH_CHECK(cond, msg)`) and `VK_THROW` (equivalent to `TORCH_CHECK(false, msg)` are introduced as well to replace calls to `TORCH_CHECK()` and similar macros.

Although this is a large diff, the most meaningful changes are in the added files `api/Exception.[h,cpp]` and `api/StringUtil.[h,cpp]` (which is mostly adapted from [`c10/util/StringUtil.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/StringUtil.h)) which implements `api::Error` and the new macros. The rest of the diff is replacing calls to `TORCH_CHECK()` and similar macros with `VK_CHECK_COND()` and `VK_THROW()`.

Differential Revision: [D52662233](https://our.internmc.facebook.com/intern/diff/D52662233/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117178
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177
2024-01-11 21:43:15 +00:00
fe298e901a [pt-vulkan][ez] Replace ska::flat_hash_map, c10::get_hash with std::unordered_map, std::hash (#117177)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

The majority of the changes in this changeset are:

* Replacing instances of `ska::flat_hash_map` with `std::unordered_map`
   * `ska::flat_hash_map` is an optimized hash map, but the optimizations shouldn't be too impactful so `std::unordered_map` should suffice. Performance regression testing will be done at the final change in this stack to verify this.
* Replacing `c10::get_hash` with `std::hash` where only one variable is getting hashed or the `utils::hash_combine()` function added to `api/Utils.h` (which was copied from `c10/util/hash.h`)

Differential Revision: [D52662231](https://our.internmc.facebook.com/intern/diff/D52662231/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117177
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176
2024-01-11 21:43:15 +00:00
57b76b970b [pt-vulkan][ez] Miscellaneous small c10 deprecations (c10::irange, C10_LIKELY, c10::SmallVector, etc.) (#117176)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset deprecates various easy-to-replace symbols from the `c10` library with either C++ STL equivalents or by using copying those `c10` symbols as native equivalents. The symbols that were impacted are:

* `c10::irange`
  * removed and replaced with standard for loops
* `C10_LIKELY` and `C10_UNLIKELY`
  * These macros allow for some branch re-ordering compiler optimizations when building with GCC. They aren't strictly necessary and their impact is likely minimal so these have simply been removed.
* `c10::SmallVector<T, N>`
  * My understanding is that `c10::SmallVector<T, N>` is essentially a wrapper around `std::vector<T>` that is optimized for array sizes up to `N`. I don't believe that this optimization is worth creating a native equivalent, so I replaced instances this symbol with replaced with `std::vector<T>`
* `c10::multiply_integers`
  * This function is simply a convenient wrapper around `std::accumulate`, so I copied it as a native equivalent in `api/Utils.h`

This changeset comprises entirely of the replacements described above.

Differential Revision: [D52662232](https://our.internmc.facebook.com/intern/diff/D52662232/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117176
Approved by: https://github.com/yipjustin
2024-01-11 21:42:24 +00:00
24c39bb5e5 Upgrade nightly wheels to rocm6.0 (#116983)
Follow-up to https://github.com/pytorch/builder/pull/1647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116983
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2024-01-11 20:36:00 +00:00
e55a778cbb [Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445)
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
2024-01-11 20:28:40 +00:00
92cc8ae172 [FSDP] Cloned unsharded tensor slice in optim state dict load (#117261)
This takes the fix from https://github.com/pytorch/pytorch/issues/116553. Cloning the slice allows the base (much larger) tensor to be freed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117261
Approved by: https://github.com/wz337
2024-01-11 20:21:12 +00:00
88bf84f106 [benchmark] add --compile-autograd to dynamo benchmarks (#117196)
Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats

e.g. accuracy_inductor.csv
```
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1
cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0
cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0
cuda,LearningToPaint,4,pass,639,2,8,7,1,1
...
```

e.g. speedup_inductor.csv
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1
cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196
Approved by: https://github.com/jansel
2024-01-11 20:12:58 +00:00
83c45a9931 Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064)
…reference) (#109065)

Summary:

Modify the way we update gc_count in CUDACachingAlloctor to make it faster.

Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions.

Differential Revision: D49068760

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064
Approved by: https://github.com/zdevito
2024-01-11 19:47:05 +00:00
5bc896e5dc Dockerfile; Add cuda bin to PATH (#117105)
We need this to execute `nvidia-smi` in the officially released containers. We have already it in the Docker CI

See
94db6578cc/.ci/docker/linter-cuda/Dockerfile (L35)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117105
Approved by: https://github.com/atalman
2024-01-11 18:10:19 +00:00
9e3580f793 Fix #117011: add the TORCH_CHECK(grad_output) of upsample_nearest::backward() (#117100)
add the TORCH_CHECK(grad_output) of upsample_nearest::backward()

Fixes #117011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117100
Approved by: https://github.com/lezcano
2024-01-11 18:06:22 +00:00
f89725fb41 [DCP][BC] Add the backward compatibility test (#116247)
This PR adds a test to ensure all metadata is backward compatible with the older definination.

Differential Revision: [D52357733](https://our.internmc.facebook.com/intern/diff/D52357733/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116247
Approved by: https://github.com/wz337
ghstack dependencies: #116245, #116246
2024-01-11 18:01:35 +00:00
7e9cbc6834 [CI] Catch more exception types when running eager in PT2 tests (#117120)
Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120
Approved by: https://github.com/huydhn
2024-01-11 17:46:11 +00:00
5b24877663 Improve uint{16,32,64} dlpack/numpy compatibility (#116808)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116808
Approved by: https://github.com/malfet, https://github.com/albanD
2024-01-11 17:01:54 +00:00
623b7fedc4 [c10d] Add comments to the rest environment variable within NCCLPG (#117092)
Not every environment within NCCLPG has comments, let's add comments to each of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117092
Approved by: https://github.com/kwen2501
ghstack dependencies: #116545
2024-01-11 16:47:25 +00:00
3d1869d0ae [DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246)
1. Better typing
2. Remove 1-liner function

Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246
Approved by: https://github.com/wz337
ghstack dependencies: #116245
2024-01-11 16:27:21 +00:00
4c7b602645 Add Support For Symbolic Shapes in Register_replacement, SDPA Pattern Matching (#115441)
Many of our pattern matching replacements are specified as a `search_fn` and a `replacment_fn`. The search_fn's are traced out once with static shapes, converted to a pattern, and then matched on every graph compiled with inductor.

The static shape patterns would not match with graphs that are traced out with dynamic shapes because SymInts would be added to the graph as `sym_size` fx nodes which added additional uses and prevented matching. The previous PR partially addresses this by deduping SymInts that are resolvable to graph inputs, as is the calling convention in aot autograd.

This PR adjusts our matching of the `search_fn` by adding SymInts to the arguments we trace out the search_fn with so that their symint accesses are deduped. Later, if we have a match, we will trace out the replacement graph with the correct Tensors and corresponding symbolic shapes that will get added to the graph.

Note: the replacement patterns will insert sym_size uses which could potentially be removed, but I'll leave that for follow up.

Fix for https://github.com/pytorch/pytorch/issues/111190.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115441
Approved by: https://github.com/jansel
ghstack dependencies: #116158
2024-01-11 15:58:37 +00:00
bfc336308a Revert "Error grad mode op in export API (#117187)"
This reverts commit 89ef426ba0d87091303f6a3c21c38749f9af72a3.

Reverted https://github.com/pytorch/pytorch/pull/117187 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117187#issuecomment-1887363580))
2024-01-11 15:01:36 +00:00
767e1b6349 Revert "Bring docstring to .pyi file (#114705)"
This reverts commit 0dd5deecedd136852c7ccc81630eaefbebe5be29.

Reverted https://github.com/pytorch/pytorch/pull/114705 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114705#issuecomment-1887165326))
2024-01-11 13:30:44 +00:00
7005a4bcb6 [dynamo] Added dyn shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866)
Description:
- Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ...

```python
import math
import torch

def func(x, a, b):
    c = 0
    c = c + math.sqrt(a)
    c = c + math.cos(a)
    c = c + math.cosh(a)
    c = c + math.sin(a)
    c = c + math.sinh(a)
    c = c + math.tan(a)
    c = c + math.tanh(a)
    c = c + math.asin(b)
    c = c + math.acos(b)
    c = c + math.atan(a)
    y = x + c
    return y

cfunc = torch.compile(func, dynamic=True, fullgraph=True)

device = "cpu"  # or "cuda"
x = torch.tensor([0, 1, 2, 3], dtype=torch.float32, device=device)
a = 12
b = 1

out = cfunc(x, a, b)
expected = func(x, a, b)
torch.testing.assert_close(out, expected)
```

and the graph `TORCH_LOGS=+graph_code python check_math_ops.py`:

<details>
<summary>
graph code
</summary>

```
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  ===== __compiled_fn_0 =====
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor):
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_a_ = L_a_
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:57, code: c = c + math.sqrt(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_sqrt = torch.sym_sqrt(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add = 0 + sym_sqrt;  sym_sqrt = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:58, code: c = c + math.cos(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_cos = torch.sym_cos(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_1 = add + sym_cos;  add = sym_cos = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:59, code: c = c + math.cosh(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_cosh = torch.sym_cosh(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_2 = add_1 + sym_cosh;  add_1 = sym_cosh = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:60, code: c = c + math.sin(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_sin = torch.sym_sin(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_3 = add_2 + sym_sin;  add_2 = sym_sin = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:61, code: c = c + math.sinh(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_sinh = torch.sym_sinh(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_4 = add_3 + sym_sinh;  add_3 = sym_sinh = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:62, code: c = c + math.tan(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_tan = torch.sym_tan(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_5 = add_4 + sym_tan;  add_4 = sym_tan = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:63, code: c = c + math.tanh(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_tanh = torch.sym_tanh(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_6 = add_5 + sym_tanh;  add_5 = sym_tanh = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:64, code: c = c + math.asin(b)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_7 = add_6 + 1.5707963267948966;  add_6 = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:65, code: c = c + math.acos(b)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_8 = add_7 + 0.0;  add_7 = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:66, code: c = c + math.atan(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_atan = torch.sym_atan(l_a_);  l_a_ = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_9 = add_8 + sym_atan;  add_8 = sym_atan = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:67, code: y = x + c
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         y = l_x_ + add_9;  l_x_ = add_9 = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (y,)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
```
</details>

Generated code with `TORCH_LOGS=+output_code python check_math_ops.py`:
<details>
<summary>
C++ code
</summary>

```
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] cpp_fused_add_0 = async_compile.cpp('''
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #include "/tmp/torchinductor_root/2l/c2ljzlm4sosod7u6lyrroqdba6hmfcyijrric6p4t3fhbcmw6osp.h"
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] extern "C" void kernel(const float* in_ptr0,
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]                        float* out_ptr0,
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]                        const long ks0,
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]                        const long ks1)
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] {
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]     {
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         #pragma GCC ivdep
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L))
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         {
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             auto tmp0 = in_ptr0[static_cast<long>(x0)];
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             auto tmp1 = c10::convert<float>(1.57079632679490 + (std::sqrt(ks1)) + (std::atan(ks1)) + (std::cos(ks1)) + (std::cosh(ks1)) + (std::sin(ks1)) + (std::sinh(ks1)) + (std::tan(ks1)) + (std::tanh(ks1)));
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             out_ptr0[static_cast<long>(x0)] = tmp2;
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         }
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]     }
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] }
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''')
```

</details>

<details>
<summary>
Triton code
</summary>

```
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @pointwise(
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     size_hints=[4],
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     filename=__file__,
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1), equal_to_1=(), i
ds_of_folded_args=(), divisible_by_8=())]},
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_0', 'mutated_arg_names': []},
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     min_elem_per_thread=0
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] )
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @triton.jit
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     xoffset = tl.program_id(0) * XBLOCK
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     xmask = xindex < xnumel
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     x0 = xindex
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp0 = tl.load(in_ptr0 + (x0), xmask)
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp1 = 1.57079632679490 + (tl.math.sqrt(ks0.to(tl.float32))) + (tl.math.atan((ks0).to(tl.float32))) + (tl.math.cos((ks0).to(tl.float32))) + (tl.math.cosh((ks0).to(tl.float32))) + (tl.math.sin((ks0)
.to(tl.float32))) + (tl.math.sinh((ks0).to(tl.float32))) + (tl.math.tan((ks0).to(tl.float32))) + (tl.math.tanh((ks0).to(tl.float32)))
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp2 = tmp1.to(tl.float32)
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp3 = tmp0 + tmp2
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tl.store(out_ptr0 + (x0), tmp3, xmask)
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''')
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114866
Approved by: https://github.com/peterbell10
2024-01-11 11:52:28 +00:00
cyy
2b5a201aa6 [Exception] [3/N] Replace torch::NotImplementedError and torch::LinAlgError with C10 counterparts. (#116824)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116824
Approved by: https://github.com/albanD
2024-01-11 11:27:04 +00:00
89ef426ba0 Error grad mode op in export API (#117187)
Summary:
This is reland of https://github.com/pytorch/pytorch/pull/116339
Needed to some internal adjustments to make it work properly. Original credit goes to andrewlee302

Test Plan: CI

Differential Revision: D52674706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117187
Approved by: https://github.com/suo
2024-01-11 09:06:59 +00:00
0e1f43c44d [inductor] don't access cluster_dims for too old version of triton (#117192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117192
Approved by: https://github.com/masnesral
2024-01-11 08:37:30 +00:00
3b2ddb6f71 Update TorchBench pinned commit (#117073)
~~To match their recent v4.36.2 release https://github.com/huggingface/transformers/commits/v4.36.2.  This is to fix the KeyError showing on release branch https://github.com/pytorch/pytorch/actions/runs/7451512288/job/20279117324#step:16:1336.  I think this can be updated in main too because the current pinned commit is already 4-month old.~~

Check with @desertfire, trying to update TorchBench pinned commit instead.

The test is also failing in main https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1120, but for some reason, it doesn't surface as a failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117073
Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi, https://github.com/desertfire
2024-01-11 08:35:00 +00:00
1cefc58905 init tls grad_mode/local_dispatch_key set while fork new thread in (#113246)
TorchDynamo will guard grad_mode and the local dispatch key set.
3a429423fc/torch/csrc/dynamo/guards.cpp (L13-L16)

While using ThroughputBenchmark, those tls state will not be init as same as the main thread status.
3a429423fc/torch/csrc/utils/throughput_benchmark-inl.h (L64-L94)

Run following scripts
```
import torch
linear = torch.nn.Linear(128, 128)
compiled = torch.compile(linear)
x = torch.rand(10, 128)
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
    compiled(x)
    compiled(x)

from torch._dynamo import config
config.error_on_recompile = True
from torch.utils import ThroughputBenchmark
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
    bench = ThroughputBenchmark(compiled)
    bench.add_input(x)
    stats = bench.benchmark(
        num_calling_threads=10,
        num_warmup_iters=100,
        num_iters=100,
    )
    print(stats)
```
will lead to 2 re-compile reasons:
```
triggered by the following guard failure(s): ___check_global_state()
triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch.
```

This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models.

throughputbenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113246
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-01-11 08:31:46 +00:00
9f57cf502f [inductor][cpu]disable pointwise_cat on CPU (#116313)
We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr.

This PR fix the following three regression issues:
https://github.com/pytorch/pytorch/issues/115827
https://github.com/pytorch/pytorch/issues/112139
https://github.com/pytorch/pytorch/issues/114495

and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-01-11 08:00:00 +00:00
e3d4f4d14b [ProxyTensor] dedupe symbolic shapes in tracing (#116158)
Dedupes symbolic shapes in proxy tensor tracing. Reusing the existing sym shape avoids inserting spurious sym_size calls, which can interfere with pattern matching and graph passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116158
Approved by: https://github.com/ezyang
2024-01-11 07:15:11 +00:00
6f9fcc79c2 [DCP][BE] Remove unused fields (#116245)
As title

Differential Revision: [D52357730](https://our.internmc.facebook.com/intern/diff/D52357730/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116245
Approved by: https://github.com/wz337
2024-01-11 06:03:09 +00:00
263cc12fab Add Dynamo Reset in PT2E Quantization testing (#117200)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/117012 by adding `torch._dynamo.reset()` in `PT2EQuantizationTestCase._quantize`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117200
Approved by: https://github.com/jerryzh168
2024-01-11 05:53:55 +00:00
5ae221a214 [ONNX] Refactor op consistency tests (#116319)
Fixes #105338

This PR changes the ops consistency tests from manual adding ops into testing list to automated testing all ops in registry. It also spots more complex dtype bugs in the converter.

Overall, this PR provides:
(1) Whole test coverage on ONNX registry
(2) More completed complex supports
(3) Only test the same dtypes as torchlib
(4) Auto xfail unsupported nodes

Follow-up issue: https://github.com/pytorch/pytorch/issues/117118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116319
Approved by: https://github.com/justinchuby
2024-01-11 05:17:40 +00:00
9b1fac694e [c10d] Add extra sleep in waitForDumpOrTimeout to ensure enough time for all ranks dump debug info (#116545)
We added an extra sleep and make it configurable so that users can set an extra wait to ensure all ranks have dumped the debug info.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116545
Approved by: https://github.com/wconstab
2024-01-11 04:39:57 +00:00
ca23c56efc [codemod] markDynamoStrictTest batch 15 (#117139)
[codemod] markDynamoStrictTest test_spectral_ops
[codemod] markDynamoStrictTest test_fx_experimental
[codemod] markDynamoStrictTest test_foreach
[codemod] markDynamoStrictTest test_decomp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117139
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127, #117128, #117129, #117133
2024-01-11 04:28:57 +00:00
9dbe4eae82 [codemod] markDynamoStrictTest batch 14 (#117133)
[codemod] markDynamoStrictTest test_utils
[codemod] markDynamoStrictTest test_unary_ufuncs
[codemod] markDynamoStrictTest test_sparse_semi_structured
[codemod] markDynamoStrictTest test_sparse_csr
[codemod] markDynamoStrictTest test_sparse
[codemod] markDynamoStrictTest test_reductions
[codemod] markDynamoStrictTest test_proxy_tensor
[codemod] markDynamoStrictTest test_prims
[codemod] markDynamoStrictTest test_maskedtensor
[codemod] markDynamoStrictTest test_masked
[codemod] markDynamoStrictTest test_legacy_vmap
[codemod] markDynamoStrictTest test_binary_ufuncs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117133
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127, #117128, #117129
2024-01-11 04:28:57 +00:00
a526d0a926 Skip all OpInfo-based test when running with PYTORCH_TEST_WITH_DYNAMO (#117129)
This is a policy decision. These tests:
- are flaky, and fixing the flakiness is unfeasible at the moment
- are highly redundant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117129
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127, #117128
2024-01-11 04:28:42 +00:00
dc43ad4286 add is_grad_enabled check in runtime_wrapper before running with torch.no_grad (#117089)
We observed that `with torch.no_grad()` in runtime_wrapper introduced ~10% (0.06ms->0.066ms) inference performance regression on lennard_jones on cpu.
For inference tasks in benchmark, grad has been disabled, but in the current runtime_wrapper, no_grad is set again and its time is counted into the running time.
Therefore, we add `is_grad_enabled` check in runtime_wrapper before running with torch.no_grad. If grad has been disabled, there is no need to set no_grad.

Before this pr:
1.043x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,lennard_jones,1,**1.043427**,**0.068366**,4.756151,0.941846,45.056819,47.838822,9,1,0,0

After this pr:
1.146x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,lennard_jones,1,**1.146190**,**0.061844**,4.468380,0.936456,44.427264,47.441920,9,1,0,0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117089
Approved by: https://github.com/jgong5, https://github.com/bdhirsh
2024-01-11 03:37:45 +00:00
203430a778 [dynamo] easy - better assert message for EQUALS_MATCH guard (#117006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117006
Approved by: https://github.com/lezcano
ghstack dependencies: #116723
2024-01-11 03:14:43 +00:00
79de14546d [export] Add TORCH_LOGS=export (#116993)
Adds TORCH_LOGS=export which currently includes dynamo/dynamic logs. In the future if we add any logs under the torch/export directory it will also show up in the TORCH_LOGS=export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116993
Approved by: https://github.com/avikchaudhuri
2024-01-11 03:02:23 +00:00
6f0f4f12ca [BugFix] Prevent LSTM to run with wrong input shape (#115542)
Fixes #114874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115542
Approved by: https://github.com/mikaylagawarecki
2024-01-11 02:57:09 +00:00
10509dac85 [C10D] Rename flightrecorder key vars to avoid confusion (#116905)
Key vars are strings used as dict keys (e.g. duration_s was a string
"duration_ms")

_s confused me with time (seconds) since duration_s was a key string and
duration_ms is another variable holding a time value.

Now duration_key is "duration_ms".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905
Approved by: https://github.com/zdevito
2024-01-11 02:57:04 +00:00
1174e82bde Revert "Add _assert_scalar and teach Inductor to codegen it (#114148)"
This reverts commit b6028acfa46363c1d3262a1522741a06c307843f.

Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))
2024-01-11 02:30:22 +00:00
0f10a706f6 add a docblock for torch._scaled_mm (#117190)
Summary:

Describes the arguments in more detail. Not in user facing docs for now, but a step towards getting there eventually.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117190
Approved by: https://github.com/drisspg
2024-01-11 02:22:44 +00:00
edec54b9de Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
71343507cd Add super().setup in test_numeric (#117148)
Call super().setUp() so that it will check the disabled test json (and also reset seeds etc)

Test:
Check that test_all_any is skipped in dynamo shard - success
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117148
Approved by: https://github.com/huydhn
2024-01-11 01:03:46 +00:00
cyy
2f17a21b2b [Reland] [13/N] Enable clang-tidy on headers of torch/csrc (#117088)
Reland of #116560 and fixes the issued reported by #116695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117088
Approved by: https://github.com/albanD
2024-01-10 23:58:04 +00:00
8783fe9cf3 [export] Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097)
Summary: As titled. #115913 added
`_scaled_dot_product_flash_attention_for_cpu` and the export result of
`scaled_dot_product_attention` includes this op. Adding this
decomposition so that it's being decomposed the same way as
`_scaled_dot_product_attention_math`.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097
Approved by: https://github.com/lezcano
2024-01-10 23:46:14 +00:00
f70aeb4ffd Fix backward for reshape() on jagged layout NT (#117137)
Provides symbolic C++-side `reshape_as()` / `reshape()` decomps for jagged layout NTs to make the backwards pass work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117137
Approved by: https://github.com/soulitzer
2024-01-10 23:35:07 +00:00
e10cfdd895 Update matmul requires_grad checks (#117067)
Fixes https://github.com/pytorch/pytorch/issues/116099
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117067
Approved by: https://github.com/lezcano, https://github.com/albanD
ghstack dependencies: #116523, #116710
2024-01-10 23:16:42 +00:00
7e6a04e542 Allow unMarkDynamoStrictTest to work on tests (instead of just classes) (#117128)
Tested locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117128
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127
2024-01-10 22:25:40 +00:00
1b8ebb6c42 [codemod] markDynamoStrictTest batch 13 (#117127)
[codemod] markDynamoStrictTest test_overrides
[codemod] markDynamoStrictTest test_namedtuple_return_api
[codemod] markDynamoStrictTest test_jiterator
[codemod] markDynamoStrictTest test_jit_disabled
[codemod] markDynamoStrictTest test_jit_autocast
[codemod] markDynamoStrictTest test_fx_reinplace_pass
[codemod] markDynamoStrictTest test_fx_passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117127
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114
2024-01-10 22:25:40 +00:00
79e6d2ae9d Remove incorrect usages of skipIfTorchDynamo (#117114)
Using `@skipifTorchDynamo` is wrong, the correct usage is
`@skipIfTorchDynamo()` or `@skipIfTorchDynamo("msg")`. This would cause
tests to stop existing.
Added an assertion for this and fixed the incorrect callsites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117114
Approved by: https://github.com/voznesenskym
2024-01-10 22:25:31 +00:00
d6540038c0 Fix 0-dim Index in Index Copy decomp (#117065)
Fix for https://github.com/pytorch/pytorch/issues/115931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117065
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-01-10 22:13:43 +00:00
b9293e74a2 [ROCm] Fixes for hipblasLt for mm use case. (#116537)
This PR fixes the accuracy issues for hipblasLT for mm case on ROCm.
This PR is a follow up to the integration PR https://github.com/pytorch/pytorch/pull/114329 and https://github.com/pytorch/pytorch/pull/114890

The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue.
Added a unit-test case for this issue (bias=None) case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116537
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-01-10 22:13:18 +00:00
7e37f63e5e [Reference Cycle Detector] Ignore FakeTensor in cycle leak detection (#117116)
Summary: Skip FakeTensors since these tensors are not actually using GPU memory. Reference Cycle Detector does not need to generate plots for these tensors.

Test Plan: CI and internal testing.

Differential Revision: D52637209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117116
Approved by: https://github.com/zdevito, https://github.com/tianfengfrank
2024-01-10 21:33:56 +00:00
3e9bb8d4de Run docker release build on final tag (#117131)
To be successful, the docker release workflow needs to run on final tag, after the Release to conda and pypi are complete.

Please refer to: https://github.com/pytorch/pytorch/blob/main/Dockerfile#L76

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117131
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2024-01-10 21:00:45 +00:00
73990c37e6 [c10d] To make ProcessGroupNCCL to use globalStore for coordination (#117075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117075
Approved by: https://github.com/wconstab
ghstack dependencies: #117074
2024-01-10 20:39:53 +00:00
180425df9b [c10d] Add a recursive method to get the inner most store (#117074)
In c10d PG initialization, we wrap TCPStore with multiple layers of PrefixStore which adds layers of prefix.

One example is:
"default_pg/0//cuda//timeout_dump"
When initialized the default PG, because there is no store passed. We first add the prefix "default_pg" to the TCPStore returned from rendezvous:

bdeaaad70c/torch/distributed/distributed_c10d.py (L1240)

We then add pg_name (aka 0) bdeaaad70c/torch/distributed/distributed_c10d.py (L1376) and device (aka cuda) bdeaaad70c/torch/distributed/distributed_c10d.py (L1387)

to the prefix. Then when we call store_->set("timeout_dump"). The actual key used for writing into TCPStore is "default_pg/0//cuda//timeout_dump".

For sub-PG, things get even interesting, we put the store wrapped with default pg name to a cache:
bdeaaad70c/torch/distributed/distributed_c10d.py (L1517)

And when creating each subPG, it is append its PG name right after the cached store. The example keys are:
'default_pg/0//10//cuda//timeout_dump', 'default_pg/0//12//cuda//timeout_dump', 'default_pg/0//38//cuda//timeout_dump', 'default_pg/0//39//cuda//timeout_dump'. (10, 12, 38 and 39 are all PG names of each subPG created)

The reason why the number in the name is bumped up so high is because for each subPG creation, all ranks have to call the API together and the global variable used for PG name will be bumped up monolithically:

bdeaaad70c/torch/distributed/distributed_c10d.py (L3666)

Similar things happen for using hashing for PG names.

This has a potential issue, because each sub-PG has an instance of ProcessGroupNCCL, and if we want to set something global to notify all sub-PGs (and all ranks). This added prefix causes bugs. For example, if on sub-PG 1, we set a value to TCPStore with key ('default_pg/0//1//cuda//timeout_dump'), while we use the default PG instances to check the TCPStore, which are using the key ('default_pg/0//cuda//timeout_dump'), default PG instances will never get the notified signals. So in this PR, we added a new API in PrefixStore which we get the innermost non-PrefixStore for set and check. The next PR will make changes in NCCL watchdog.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117074
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-01-10 20:22:55 +00:00
6f8fc42dba [inductor] Add support for tl.make_block_ptr (#116079)
On A100 this is a small regression:
![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171)

So I will leave it disabled by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079
Approved by: https://github.com/shunting314
2024-01-10 20:02:49 +00:00
9bf9586c6d Pytest do not rewrite assertions by default (#117060)
From https://pytest.org/en/7.4.x/how-to/assert.html#advanced-assertion-introspection
pytest only rewrites test modules directly discovered by its test collection process, so asserts in supporting modules which are not themselves test modules will not be rewritten.

In CI we usually call the test file (`python test_ops.py`), which then calls run_test which then calls pytest.main, so the test module is already imported as `__main__`, so pytest does not import the test module itself and relies on the already imported module.  (#95844)

However, calling `pytest test_ops.py` will rely on pytest to import the module, resulting in asserts being rewritten, so I add --assert=plain by default into the opts so we don't have to worry about this anymore.  Another way to make pytest stop assertion rewriting in a file is to include `PYTEST_DONT_REWRITE` somewhere in the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117060
Approved by: https://github.com/zou3519
2024-01-10 20:02:45 +00:00
fad7734fa7 [AOTI] Remove caching for compiled model.so (#117087)
Summary: Oleg found the model.so caching does not compute hash key with model weights included, which can cause incorrect model.so reuse. Since caching is not really necessary in the AOT mode, let's just remove it.

Test Plan: CI

Differential Revision: D52647555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117087
Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov
2024-01-10 19:53:27 +00:00
e4e80dc9b3 [FSDP] sharded grad scaler: copy found_inf after waiting on async reduce_all (#115710)
**Expected behavior**: when rank 0 have inf grad, rank 1...k should get `found_inf=1` after `dist.reduce_all`
**Bug addressed in this PR**: for cpu offloaded param.grad, when rank 0 have inf, rank 1...k would not have found_inf=1. This is because `found_inf` was copied before `future.wait` on async `dist.reduce_all`

repro the bug using the newly added unit test: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`

```
  File "/data/users/weif/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 320, in _test_sharded_grad_scaler_found_inf
    self.assertEqual(
  File "/data/users/weif/pytorch/torch/testing/_internal/common_utils.py", line 3576, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not close!

Expected 1.0 but got 2.0.
Absolute difference: 1.0 (up to 1e-05 allowed)
Relative difference: 1.0 (up to 1.3e-06 allowed)
rank: 0 iter: 0 expect origin scale 2.0 to be backed off by 0.5 but got 2.0
```

verify the bug is fixed: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`

```
test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py dist init r=1, world=8
dist init r=3, world=8
dist init r=7, world=8
dist init r=4, world=8
dist init r=6, world=8
dist init r=2, world=8
dist init r=0, world=8
dist init r=5, world=8
NCCL version 2.19.3+cuda12.0
.                                                                                                                 [100%]

====================================================================== 1 passed, 19 deselected in 27.43s =========================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115710
Approved by: https://github.com/awgu
2024-01-10 19:17:25 +00:00
9eb842cbd6 Compiled autograd: Lift autograd functions' backward and provide default key for custom autograd functions (#115573)
This PR adds support for torch.autograd.Function subclasses in compiled autograd. We do this by:
- Creating a uid for all torch.autograd.Function via its metaclass. This uid is used in the compiled autograd key, which is a subset of the cache key to the compiled graph
- "Lifting" the backward/saved_tensors, having them as input arguments in the compiled graph
  - Creating proxies to track the backward's inputs and outputs. Since the backward's outputs (grads) have to match the forward's inputs, we pass the node's `input_info` (forward's input sizes) to build the proxies tracking the backward's outputs.
  - Use a `FakeContext` class as a replacement for the autograd node's context object (`BackwardCFunction`) during tracing, only support passing saved_tensors from the forward to the backward
  - Index each backward, to support multiple torch.autograd.Functions in the same graph
  - Special case for `CompiledFunctionBackward`, lifting CompiledFunction will fail 4 tests and requires some skipfiles changes that I'd rather do that in a separate PR

Example graph: test_custom_fn_saved_multiple_tensors (eager fw + compiled autograd)
```python
class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        ctx.save_for_backward(x, y)
        return torch.sin(x), torch.sin(y)

    @staticmethod
    def backward(ctx, gO_x, gO_y):
        (x, y) = ctx.saved_tensors
        return gO_x * torch.cos(x), gO_y * torch.cos(y)
```
The backwards is lifted via `getitem_5` and `call_backward`
```python
# Compiled autograd graph
 ===== Compiled autograd graph =====
 <eval_with_key>.0 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]" = inputs[0]
        getitem_1: "f32[10]" = inputs[1]
        getitem_2: "f32[10]" = inputs[2]
        getitem_3: "f32[10]" = inputs[3]
        getitem_4: "f32[10]" = inputs[4];  inputs = None
        expand: "f32[10]" = torch.ops.aten.expand.default(getitem, [10]);  getitem = None
        mul: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_2);  getitem_2 = None
        mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_1);  expand = getitem_1 = None
        getitem_5 = hooks[0];  hooks = None
        call_backward = torch__dynamo_external_utils_call_backward(getitem_5, (getitem_3, getitem_4), mul_1, mul);  getitem_5 = mul_1 = mul = None
        getitem_6: "f32[10]" = call_backward[0]
        getitem_7: "f32[10]" = call_backward[1];  call_backward = None
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7);  getitem_4 = getitem_7 = None
        accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6);  getitem_3 = getitem_6 = None
        return []
```

then is later inlined by dynamo
```python
# Dynamo graph
 ===== __compiled_fn_0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, L_inputs_0_ : torch.Tensor, L_inputs_1_ : torch.Tensor, L_inputs_2_ : torch.Tensor, L_inputs_3_ : torch.Tensor, L_inputs_4_ : torch.Tensor):
        getitem = L_inputs_0_
        getitem_1 = L_inputs_1_
        getitem_2 = L_inputs_2_
        x = L_inputs_3_
        y = L_inputs_4_

        # File: <eval_with_key>.0:10, code: expand = torch.ops.aten.expand.default(getitem, [10]);  getitem = None
        expand = torch.ops.aten.expand.default(getitem, [10]);  getitem = None

        # File: <eval_with_key>.0:11, code: mul = torch.ops.aten.mul.Tensor(expand, getitem_2);  getitem_2 = None
        mul = torch.ops.aten.mul.Tensor(expand, getitem_2);  getitem_2 = None

        # File: <eval_with_key>.0:12, code: mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1);  expand = getitem_1 = None
        mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1);  expand = getitem_1 = None

        # File: /data/users/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py:412, code: return gO_x * torch.cos(x), gO_y * torch.cos(y)
        cos = torch.cos(x)
        getitem_6 = mul_1 * cos;  mul_1 = cos = None
        cos_1 = torch.cos(y)
        getitem_7 = mul * cos_1;  mul = cos_1 = None

        # File: <eval_with_key>.0:17, code: accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7);  getitem_4 = getitem_7 = None
        accumulate_grad__default = torch.ops.inductor.accumulate_grad_.default(y, getitem_7);  y = getitem_7 = None

        # File: <eval_with_key>.0:18, code: accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6);  getitem_3 = getitem_6 = None
        accumulate_grad__default_1 = torch.ops.inductor.accumulate_grad_.default(x, getitem_6);  x = getitem_6 = None
        return ()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115573
Approved by: https://github.com/jansel
2024-01-10 18:01:28 +00:00
b4a35632f9 Add function to materialize COW storages (#117053)
Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems)

Test Plan: sandcastle, OSS CI

Differential Revision: D52610522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053
Approved by: https://github.com/malfet, https://github.com/kurtamohler
2024-01-10 15:34:16 +00:00
ec98df70f3 [CPU] _vec_softmax_backward, _vec_log_softmax_backward, _vec_logsoftmax: fix CHUNK_SIZE to avoid unnecessarily large allocation (#117029)
Similar to https://github.com/pytorch/pytorch/pull/116990, fixes `CHUNK_SIZE` in `_vec_softmax_backward`, `_vec_log_softmax_backward`, `_vec_logsoftmax`, where `CHUNK_SIZE` is set as
```cpp
int64_t BLOCK_SIZE = 128 * 1024;
int64_t CHUNK_SIZE = std::max<int64_t>(BLOCK_SIZE / dim_size / sizeof(scalar_t), Vec::size());
CHUNK_SIZE = CHUNK_SIZE / Vec::size() * Vec::size();
```
where `BLOCK_SIZE / dim_size / sizeof(scalar_t)` computes the maximum number of inner dim that can fit into L2 cache, assuming L2 cache = 128KB, and `CHUNK_SIZE / Vec::size() * Vec::size()` is to make `CHUNK_SIZE` a multiple of `Vec::size()`.

Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `inner_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer.
```cpp
auto buffer = std::make_unique<scalar_t []>(CHUNK_SIZE * 2);
scalar_t* input_max_data = buffer.get();
scalar_t* tmp_sum_data = buffer.get() + CHUNK_SIZE;
```

### Performance

Perf data of `_vec_logsoftmax` collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `BLOCK_SIZE / dim_size / sizeof(scalar_t)` for all values of `outer_size`.

Tested on 28 physical cores/socket, 1 socket on Skylake.

| **dim_size** 	| **BLOCK_SIZE / dim_size / sizeof(scalar_t)** 	| **input shape: (dim_size, inner_size)** 	| **Baseline (original implementation)** 	| **Optimized** 	| **Speedup Ratio (Baseline/Optimized)** 	|
|--------------	|----------------------------------------------	|-----------------------------------------	|----------------------------------------	|---------------	|----------------------------------------	|
| 1            	| 32768                                        	| (1, 1)                                  	| 0.012578964                            	| 0.003523827   	| **3.569689**                           	|
|              	|                                              	| (1, 2)                                  	| 0.012645721                            	| 0.003550053   	| **3.562122**                           	|
|              	|                                              	| (1, 4)                                  	| 0.01303196                             	| 0.003521442   	| **3.700745**                           	|
|              	|                                              	| (1, 8)                                  	| 0.01275301                             	| 0.003552437   	| **3.589933**                           	|
| 2            	| 16384                                        	| (2, 1)                                  	| 0.008230209                            	| 0.003688335   	| **2.231416**                           	|
|              	|                                              	| (2, 2)                                  	| 0.00821352                             	| 0.003502369   	| **2.345133**                           	|
|              	|                                              	| (2, 4)                                  	| 0.008280277                            	| 0.003442764   	| **2.405125**                           	|
|              	|                                              	| (2, 8)                                  	| 0.0086236                              	| 0.003490448   	| **2.470628**                           	|
| 4            	| 8192                                         	| (4, 1)                                  	| 0.005865097                            	| 0.003454685   	| **1.697723**                           	|
|              	|                                              	| (4, 2)                                  	| 0.005846024                            	| 0.003490448   	| **1.674863**                           	|
|              	|                                              	| (4, 4)                                  	| 0.006036758                            	| 0.0035429     	| **1.703903**                           	|
|              	|                                              	| (4, 8)                                  	| 0.005993843                            	| 0.003669262   	| **1.633528**                           	|
| 8            	| 4096                                         	| (8, 1)                                  	| 0.00469923                             	| 0.003535748   	| **1.329063**                           	|
|              	|                                              	| (8, 2)                                  	| 0.004696846                            	| 0.003600121   	| **1.304636**                           	|
|              	|                                              	| (8, 4)                                  	| 0.005483627                            	| 0.003721714   	| **1.473414**                           	|
|              	|                                              	| (8, 8)                                  	| 0.005180836                            	| 0.00389576    	| **1.329865**                           	|
| 16           	| 2048                                         	| (16, 1)                                 	| 0.00446558                             	| 0.003738403   	| **1.194515**                           	|
|              	|                                              	| (16, 2)                                 	| 0.004258156                            	| 0.00382185    	| **1.114161**                           	|
|              	|                                              	| (16, 4)                                 	| 0.004422665                            	| 0.004007816   	| **1.10351**                            	|
|              	|                                              	| (16, 8)                                 	| 0.004923344                            	| 0.004308224   	| **1.142778**                           	|
| 32           	| 1024                                         	| (32 , 1)                                	| 0.004467964                            	| 0.00402689    	| **1.109532**                           	|
|              	|                                              	| (32, 2)                                 	| 0.004336834                            	| 0.004196167   	| 1.033523                               	|
|              	|                                              	| (32, 4)                                 	| 0.004661083                            	| 0.004513264   	| 1.032752                               	|
|              	|                                              	| (32, 8)                                 	| 0.005385876                            	| 0.005121231   	| **1.051676**                           	|
| 64           	| 512                                          	| (64, 1)                                 	| 0.004725456                            	| 0.00462532    	| 1.021649                               	|
|              	|                                              	| (64, 2)                                 	| 0.005085468                            	| 0.004930496   	| 1.031431                               	|
|              	|                                              	| (64, 4)                                 	| 0.005791187                            	| 0.005600452   	| 1.034057                               	|
|              	|                                              	| (64, 8)                                 	| 0.007030964                            	| 0.006783009   	| 1.036555                               	|
| 128          	| 256                                          	| (128, 1)                                	| 0.005710125                            	| 0.005786419   	| _0.986815_                             	|
|              	|                                              	| (128, 2)                                	| 0.006377697                            	| 0.006473064   	| _0.985267_                             	|
|              	|                                              	| (128, 4)                                	| 0.00754118                             	| 0.007488728   	| 1.007004                               	|
|              	|                                              	| (128, 8)                                	| 0.009772778                            	| 0.009725094   	| 1.004903                               	|
| 256          	| 128                                          	| (256 , 1)                               	| 0.007708073                            	| 0.007715225   	| _0.999073_                             	|
|              	|                                              	| (256, 2)                                	| 0.008938313                            	| 0.009071827   	| _0.985283_                             	|
|              	|                                              	| (256, 4)                                	| 0.011227131                            	| 0.011045933   	| 1.016404                               	|
|              	|                                              	| (256, 8)                                	| 0.016131401                            	| 0.016396046   	| _0.983859_                             	|
| 512          	| 64                                           	| (512, 1)                                	| 0.011544228                            	| 0.011487007   	| 1.004981                               	|
|              	|                                              	| (512, 2)                                	| 0.014071465                            	| 0.014281273   	| _0.985309_                             	|
|              	|                                              	| (512, 4)                                	| 0.019016266                            	| 0.018930435   	| 1.004534                               	|
|              	|                                              	| (512, 8)                                	| 0.028913021                            	| 0.028159618   	| 1.026755                               	|

Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16, 32), we observe significant speedups (greater than 5% better, **bolded**)  as smaller the `dim_size`, larger the `BLOCK_SIZE / dim_size / sizeof(scalar_t)`, hence larger the unnecessary allocation.

For larger `dim_size` (64, 128, 256, 512), we also observe insignificantly better (less than 5% better, unbolded) performance.
For some shapes such as {128, 1}, we also observe insignificantly worse (less than 5% worse, _italicized_) performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117029
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-10 15:04:34 +00:00
e0da05e1ba [codemod] markDynamoStrictTest dynamo/* (#117077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117077
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117076
2024-01-10 14:37:52 +00:00
04f788f925 Unflake test_auto_functionalize (#117076)
feat better cleanup of the custom op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117076
Approved by: https://github.com/bdhirsh
2024-01-10 14:37:52 +00:00
5046b4981d [ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329)
Disabling layout optimisation in inductor for ROCm (https://github.com/pytorch/pytorch/pull/111474) was a bit shortsighted.

If there are workloads that heavily use NHWC we will see a perf drop from additional transpose ops. Instead of disabling this entirely on ROCm this is now an opt-in feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116329
Approved by: https://github.com/jansel, https://github.com/eellison
2024-01-10 13:56:27 +00:00
94db6578cc [Quant] Add dynamic quantization config for x86 inductor backend (#115337)
**Description**
Add dynamic quantization config for x86 inductor backend.
To support the QKV structure in self-attention, we removed an assertion in port-metadata-pass that requires single dequantize node after quantize node.

**Test plan**
```
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_dynamic_quant_linear
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_qat_dynamic_quant_linear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115337
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-01-10 11:33:37 +00:00
558cc69641 Fix torch function kwarg dispatch (#117083)
Previously, kwargs were incorrectly dispatched by passing them as the true kwargs to the torch function call. To fix, the kwargs of the original torch op need to be stored in a dictionary and passed as an argument to the torch function implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117083
Approved by: https://github.com/drisspg
2024-01-10 10:55:10 +00:00
e88d0648ed Revert "[export] Error grad mode op in export API (#116339)"
This reverts commit 943179852102ac0be27aeae5a2c0272e25ccf90e.

Reverted https://github.com/pytorch/pytorch/pull/116339 on behalf of https://github.com/tugsbayasgalan due to PR below this in the stack broke torchrec/sigmoid tests ([comment](https://github.com/pytorch/pytorch/pull/116339#issuecomment-1884599027))
2024-01-10 10:42:33 +00:00
77ecb3d725 Revert "[export] Exempt autograd ops for predispatch export (#116527)"
This reverts commit af2ded23eb398e14cf380b39d46bfa786d26b3ee.

Reverted https://github.com/pytorch/pytorch/pull/116527 on behalf of https://github.com/tugsbayasgalan due to Need to revert this to revert the bottom diff ([comment](https://github.com/pytorch/pytorch/pull/116527#issuecomment-1884592658))
2024-01-10 10:38:27 +00:00
20f394f10a [LLVM/TensorExpr] Update for an API change in LLVM 18. (#117086)
`registerPassBuilderCallbacks` takes now an extra bool argument to print extra information. Currently initialized to false to not change functional behaviour.

Relevant LLVM commit:
ffb1f20e0d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117086
Approved by: https://github.com/bertmaher
2024-01-10 09:08:42 +00:00
cyy
20f769544c [12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)
This PR follows #116751.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2024-01-10 08:48:14 +00:00
90df7c008a Migrate state_dict bc test to OptimizerInfo, increase coverage (#116500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116500
Approved by: https://github.com/albanD
2024-01-10 08:19:27 +00:00
8323 changed files with 141368 additions and 60226 deletions

View File

@ -204,7 +204,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.6
ROCM_VERSION=5.7
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -215,7 +215,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.7
ROCM_VERSION=6.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -277,6 +277,7 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
UNINSTALL_DILL=yes
;;
pytorch-linux-jammy-py3-clang12-executorch)
ANACONDA_PYTHON_VERSION=3.10
@ -296,6 +297,15 @@ case "$image" in
CUDA_VERSION=11.8
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
@ -349,7 +359,7 @@ if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
fi
# Build image
docker build \
DOCKER_BUILDKIT=1 docker build \
--no-cache \
--progress=plain \
--build-arg "BUILD_ENVIRONMENT=${image}" \
@ -387,6 +397,7 @@ docker build \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \
--build-arg "ACL=${ACL:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -1 +1 @@
663882fe7dc518c04adf3d2ee5ccb7d99f41ade4
e2a8f9548aecb62a68e264607174a7d207ed2929

View File

@ -1 +1 @@
6c26faa159b79a42d7fa46cb66e2d21523351987
243e186efbf7fb93328dd6b34927a4e8c8f24395

View File

@ -1 +1 @@
dafe1459823b9549417ed95e9720f1b594fab329
d08e16b738ab550c3af51305df624d5c823dc445

View File

@ -1 +1 @@
e28a256d71f3cf2bcc7b69d6bda73a9b855e385e
79c6c9b209a5692b9a895398f4f3a033f8f80415

View File

@ -0,0 +1,16 @@
set -euo pipefail
readonly version=v23.08
readonly src_host=https://review.mlplatform.org/ml
readonly src_repo=ComputeLibrary
# Clone ACL
[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git
cd ${src_repo}
git checkout $version
# Build with scons
scons -j8 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \
os=linux arch=armv8a build=native multi_isa=1 \
fixed_format_kernels=1 openmp=1 cppthreads=0

View File

@ -153,7 +153,7 @@ wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2
tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2
cd valgrind-${VALGRIND_VERSION}
./configure --prefix=/usr/local
make -j6
make -j$[$(nproc) - 2]
sudo make install
cd ../../
rm -rf valgrind_build

View File

@ -9,10 +9,19 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
if [[ $(uname -m) == "aarch64" ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
case "$MAJOR_PYTHON_VERSION" in
2)
CONDA_FILE="Miniconda2-latest-Linux-x86_64.sh"
3)
CONDA_FILE="Miniforge3-Linux-aarch64.sh"
;;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
exit 1
;;
esac
else
case "$MAJOR_PYTHON_VERSION" in
3)
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
;;
@ -21,6 +30,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
exit 1
;;
esac
fi
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
@ -47,15 +57,39 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Uncomment the below when resolved to track the latest conda update
# as_jenkins conda update -y -n base conda
if [[ $(uname -m) == "aarch64" ]]; then
export SYSROOT_DEP="sysroot_linux-aarch64=2.17"
else
export SYSROOT_DEP="sysroot_linux-64=2.17"
fi
# Install correct Python version
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"
# Also ensure sysroot is using a modern GLIBC to match system compilers
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\
python="$ANACONDA_PYTHON_VERSION" \
${SYSROOT_DEP}
# libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 -c conda-forge
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then
conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}
if [[ $(uname -m) == "aarch64" ]]; then
CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}
else
conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}
fi
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
fi
fi
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
@ -89,14 +123,5 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
pip_install -r /opt/conda/requirements-docs.txt
fi
# HACK HACK HACK
# gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu
# Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda
# So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0
# Same is true for gcc-12 from Ubuntu-22.04
if grep -e [12][82].04.[623] /etc/issue >/dev/null; then
rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6
fi
popd
fi

View File

@ -48,7 +48,6 @@ setup_executorch() {
install_flatc_from_source
pip_install .
build_executorch_runner "cmake"
# Make sure that all the newly generate files are owned by Jenkins
chown -R jenkins .

View File

@ -26,13 +26,14 @@ pip_install \
pytest-cov==4.0.0 \
pytest-subtests==0.10.0 \
tabulate==0.9.0 \
transformers==4.32.1
transformers==4.36.2
pip_install coloredlogs packaging
retry pip_install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ --no-cache-dir --no-input ort-nightly==1.17.0.dev20231005006
pip_install -i https://test.pypi.org/simple/ onnx==1.15.0rc2
pip_install onnxscript==0.1.0.dev20231128 --no-deps
pip_install onnxruntime==1.17.0
pip_install onnx==1.15.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240301 --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -9,7 +9,8 @@ tar xf "${OPENSSL}.tar.gz"
cd "${OPENSSL}"
./config --prefix=/opt/openssl -d '-Wl,--enable-new-dtags,-rpath,$(LIBRPATH)'
# NOTE: openssl install errors out when built with the -j option
make -j6; make install_sw
NPROC=$[$(nproc) - 2]
make -j${NPROC}; make install_sw
# Link the ssl libraries to the /usr/lib folder.
sudo ln -s /opt/openssl/lib/lib* /usr/lib
cd ..

View File

@ -2,55 +2,17 @@
set -ex
# This function installs protobuf 3.17
install_protobuf_317() {
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
# -j6 to balance memory usage and speed.
# naked `-j` seems to use too much memory.
pushd "$pb_dir" && ./configure && make -j6 && make -j6 check && sudo make -j6 install && sudo ldconfig
popd
rm -rf $pb_dir
}
install_ubuntu() {
# Ubuntu 14.04 has cmake 2.8.12 as the default option, so we will
# install cmake3 here and use cmake3.
apt-get update
if [[ "$UBUNTU_VERSION" == 14.04 ]]; then
apt-get install -y --no-install-recommends cmake3
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
install_protobuf_317
}
install_centos() {
install_protobuf_317
}
# Install base packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
NPROC=$[$(nproc) - 2]
pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig
popd
rm -rf $pb_dir

View File

@ -80,6 +80,14 @@ install_ubuntu() {
fi
fi
# ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@ -151,6 +159,14 @@ install_centos() {
fi
fi
# ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
fi
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -7,7 +7,7 @@ git clone https://bitbucket.org/icl/magma.git
pushd magma
# Version 2.7.2 + ROCm related updates
git checkout 823531632140d0edcb7e77c3edc0e837421471c5
git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

View File

@ -64,5 +64,6 @@ if [ -n "${CONDA_CMAKE}" ]; then
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
conda_reinstall cmake="${CMAKE_VERSION}"
conda_reinstall numpy="${NUMPY_VERSION}"
# Note that we install numpy with pip as conda might not have the version we want
pip_install --force-reinstall numpy=="${NUMPY_VERSION}"
fi

View File

@ -36,7 +36,12 @@ function install_ucc() {
git submodule update --init --recursive
./autogen.sh
./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-cuda=$with_cuda
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
./configure --prefix=$UCC_HOME \
--with-ucx=$UCX_HOME \
--with-cuda=$with_cuda \
--with-nvcc-gencode="${NVCC_GENCODE}"
time make -j
sudo make install

View File

@ -15,7 +15,7 @@ click
#Pinned versions:
#test that import:
coremltools==5.0b5
coremltools==5.0b5 ; python_version < "3.12"
#Description: Apple framework for ML integration
#Pinned versions: 5.0b5
#test that import:
@ -25,6 +25,11 @@ coremltools==5.0b5
#Pinned versions:
#test that import:
dill==0.3.7
#Description: dill extends pickle with serializing and de-serializing for most built-ins
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.1.6
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
@ -47,6 +52,11 @@ junitparser==2.1.1
#Pinned versions: 2.1.1
#test that import:
lark==0.12.0
#Description: parser
#Pinned versions: 0.12.0
#test that import:
librosa>=0.6.2 ; python_version < "3.11"
#Description: A python package for music and audio analysis
#Pinned versions: >=0.6.2
@ -66,7 +76,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#Description: A testing library that allows you to replace parts of your
#system under test with mock objects
#Pinned versions:
#test that import: test_module_init.py, test_modules.py, test_nn.py,
#test that import: test_modules.py, test_nn.py,
#test_testing.py
#MonkeyType # breaks pytorch-xla-linux-bionic-py3.7-clang8
@ -75,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.7.0
mypy==1.8.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.7.0
#Pinned versions: 1.8.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -137,9 +147,9 @@ optree==0.9.1
#test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
#test_fake_tensor.py, test_mps.py
pillow==10.0.1
pillow==10.2.0
#Description: Python Imaging Library fork
#Pinned versions: 10.0.1
#Pinned versions: 10.2.0
#test that import:
protobuf==3.20.2
@ -162,11 +172,6 @@ pytest-xdist==3.3.1
#Pinned versions:
#test that import:
pytest-shard==0.1.2
#Description: plugin spliting up tests in pytest
#Pinned versions:
#test that import:
pytest-flakefinder==1.1.0
#Description: plugin for rerunning tests a fixed number of times in pytest
#Pinned versions: 1.1.0
@ -243,7 +248,8 @@ tb-nightly==2.13.0a20230426
#Pinned versions:
#test that import:
#typing-extensions
# needed by torchgen utils
typing-extensions
#Description: type hints for python
#Pinned versions:
#test that import:
@ -258,7 +264,8 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#Pinned versions:
#test that import:
lintrunner==0.10.7
#wheel not found on aarch64, and source build requires rust
lintrunner==0.10.7 ; platform_machine == "x86_64"
#Description: all about linters!
#Pinned versions: 0.10.7
#test that import:
@ -268,14 +275,14 @@ rockset==1.0.3
#Pinned versions: 1.0.3
#test that import:
ghstack==0.7.1
ghstack==0.8.0
#Description: ghstack tool
#Pinned versions: 0.7.1
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.2
jinja2==3.1.3
#Description: jinja2 template engine
#Pinned versions: 3.1.2
#Pinned versions: 3.1.3
#test that import:
pytest-cpp==2.3.0
@ -293,7 +300,8 @@ tensorboard==2.13.0
#Pinned versions:
#test that import: test_tensorboard
pywavelets==1.4.1
pywavelets==1.4.1 ; python_version < "3.12"
pywavelets==1.5.0 ; python_version >= "3.12"
#Description: This is a requirement of scikit-image, we need to pin
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1

View File

@ -1 +1 @@
2.2.0
2.3.0

View File

@ -37,6 +37,7 @@ COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi
# Install gcc
ARG GCC_VERSION
@ -160,6 +161,13 @@ COPY ./common/install_onnx.sh ./common/common_utils.sh ./
RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi
RUN rm install_onnx.sh common_utils.sh
# (optional) Build ACL
ARG ACL
COPY ./common/install_acl.sh install_acl.sh
RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi
RUN rm install_acl.sh
ENV INSTALLED_ACL ${ACL}
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -82,6 +82,19 @@ if ! which conda; then
fi
else
export CMAKE_PREFIX_PATH=/opt/conda
# Workaround required for MKL library linkage
# https://github.com/pytorch/pytorch/issues/119557
if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"
export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"
fi
fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export USE_MKLDNN=1
export USE_MKLDNN_ACL=1
export ACL_ROOT_DIR=/ComputeLibrary
fi
if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
@ -242,6 +255,11 @@ else
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0 release candidate for builds
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0rc1
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py bdist_wheel

View File

@ -158,6 +158,11 @@ function install_torchvision() {
fi
}
function install_tlparse() {
pip_install --user "tlparse==0.3.5"
PATH="$(python -m site --user-base)/bin:$PATH"
}
function install_torchrec_and_fbgemm() {
local torchrec_commit
torchrec_commit=$(get_pinned_commit torchrec)
@ -173,7 +178,7 @@ function install_torchrec_and_fbgemm() {
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.3 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

View File

@ -9,7 +9,7 @@ sysctl -a | grep machdep.cpu
# These are required for both the build job and the test job.
# In the latter to test cpp extensions.
export MACOSX_DEPLOYMENT_TARGET=11.0
export MACOSX_DEPLOYMENT_TARGET=11.1
export CXX=clang++
export CC=clang

View File

@ -149,6 +149,8 @@ test_jit_hooks() {
assert_git_not_dirty
}
install_tlparse
if [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then

View File

@ -34,7 +34,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test
# functional collective tests
time python test/run_test.py --verbose -i distributed/test_functional_api
# DTensor tests
time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile
@ -49,6 +48,7 @@ time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_ex
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors
time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu
time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype
time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping
assert_git_not_dirty

View File

@ -130,6 +130,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"
# setting PYTHON_TEST_EXTRA_OPTION
export PYTHON_TEST_EXTRA_OPTION="--xpu"
fi
if [[ "$TEST_CONFIG" == *crossref* ]]; then
@ -137,6 +139,8 @@ if [[ "$TEST_CONFIG" == *crossref* ]]; then
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# regression in ROCm 6.0 on MI50 CI runners due to hipblaslt; remove in 6.1
export VALGRIND=OFF
# Print GPU info
rocminfo
rocminfo | grep -E 'Name:.*\sgfx|Marketing'
@ -159,6 +163,8 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
install_tlparse
# DANGER WILL ROBINSON. The LD_PRELOAD here could cause you problems
# if you're not careful. Check this if you made some changes and the
# ASAN test is not working
@ -250,14 +256,14 @@ test_python_shard() {
# Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
test_python() {
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
@ -268,34 +274,13 @@ test_dynamo_shard() {
exit 1
fi
python tools/dynamo/verify_dynamo.py
# Temporarily disable test_fx for dynamo pending the investigation on TTS
# regression in https://github.com/pytorch/torchdynamo/issues/784
# PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.
# Instead, use @skipIfTorchDynamo on your tests.
time python test/run_test.py --dynamo \
--exclude-inductor-tests \
--exclude-jit-executor \
--exclude-distributed-tests \
--exclude \
test_ao_sparsity \
test_autograd \
test_jit \
test_proxy_tensor \
test_quantization \
test_public_bindings \
test_dataloader \
test_reductions \
test_namedtensor \
test_namedtuple_return_api \
profiler/test_profiler \
profiler/test_profiler_tree \
test_overrides \
test_python_dispatch \
test_fx \
test_package \
test_legacy_vmap \
test_custom_ops \
test_content_store \
export/test_db \
functorch/test_dims \
functorch/test_aotdispatch \
--exclude-torch-export-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
assert_git_not_dirty
@ -307,8 +292,16 @@ test_inductor_distributed() {
pytest test/inductor/test_torchinductor.py -k test_multi_gpu
pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device
pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices
pytest test/distributed/test_c10d_functional_native.py
pytest test/distributed/_tensor/test_dtensor_compile.py
pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp
pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
@ -330,6 +323,14 @@ test_inductor() {
fi
}
test_inductor_cpp_wrapper_abi_compatible() {
export TORCHINDUCTOR_ABI_COMPATIBLE=1
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
# cpu stack allocation causes segfault and needs more investigation
TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
# For example 'dynamic_aot_eager_torchbench' TEST_CONFIG means we run
# the benchmark script with '--dynamic-shapes --backend aot_eager --device cuda'
@ -422,7 +423,7 @@ test_perf_for_dashboard() {
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
@ -466,6 +467,11 @@ test_single_dynamo_benchmark() {
test_perf_for_dashboard "$suite" \
"${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"
else
if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
# Test AOTInductor with the ABI-compatible mode on CI
# This can be removed once the ABI-compatible mode becomes default.
export TORCHINDUCTOR_ABI_COMPATIBLE=1
fi
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain \
"${DYNAMO_BENCHMARK_FLAGS[@]}" \
@ -522,7 +528,7 @@ test_inductor_torchbench_smoketest_perf() {
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
# The perf number of nanogpt seems not very stable, e.g.
@ -543,6 +549,50 @@ test_inductor_torchbench_smoketest_perf() {
done
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
#set jemalloc
JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
end_core=$(( CORES-1 ))
MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv
grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg
do
local model_name=${model_cfg[0]}
local data_type=${model_cfg[1]}
local speedup_target=${model_cfg[4]}
if [[ ${model_cfg[3]} == "cpp" ]]; then
export TORCHINDUCTOR_CPP_WRAPPER=1
else
unset TORCHINDUCTOR_CPP_WRAPPER
fi
local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"
if [[ ${model_cfg[2]} == "dynamic" ]]; then
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \
--dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"
else
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \
--freezing --timeout 9000 --backend=inductor --output "$output_name"
fi
cat "$output_name"
# The threshold value needs to be actively maintained to make this check useful.
python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"
done
}
test_python_gloo_with_tls() {
source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
assert_git_not_dirty
@ -693,9 +743,8 @@ test_xpu_bin(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*
do
if [[ "$xpu_case" != *"*"* ]]; then
for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*; do
if [[ "$xpu_case" != *"*"* && "$xpu_case" != *.so && "$xpu_case" != *.a ]]; then
case_name=$(basename "$xpu_case")
echo "Testing ${case_name} ..."
"$xpu_case" --gtest_output=xml:"$TEST_REPORTS_DIR"/"$case_name".xml
@ -943,7 +992,8 @@ test_bazel() {
tools/bazel test --config=cpu-only --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests
else
tools/bazel test --test_output=errors \
# Increase the test timeout to 480 like CPU tests because modules_test frequently timeout
tools/bazel test --test_timeout=480 --test_output=errors \
//:any_test \
//:autograd_test \
//:dataloader_test \
@ -1038,14 +1088,17 @@ test_docs_test() {
}
test_executorch() {
echo "Install torchvision and torchaudio"
install_torchvision
install_torchaudio
pushd /executorch
echo "Install torchvision and torchaudio"
# TODO(huydhn): Switch this to the pinned commits on ExecuTorch once they are
# there. These libraries need to be built here, and not part of the Docker
# image because they require the target version of torch to be installed first
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git"
# NB: We need to build ExecuTorch runner here and not inside the Docker image
# because it depends on PyTorch
# shellcheck disable=SC1091
source .ci/scripts/utils.sh
build_executorch_runner "cmake"
echo "Run ExecuTorch regression tests for some models"
# NB: This is a sample model, more can be added here
@ -1114,6 +1167,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
@ -1123,6 +1181,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then
install_torchvision
test_inductor_cpp_wrapper_abi_compatible
elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
install_torchvision
test_inductor

View File

@ -16,11 +16,6 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers
call %INSTALLER_DIR%\install_mkl.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
call %INSTALLER_DIR%\install_magma.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
@ -35,6 +30,10 @@ call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
:: Override VS env here
pushd .
if "%VC_VERSION%" == "" (
@ -89,8 +88,8 @@ set SCCACHE_IGNORE_SERVER_IO_ERROR=1
sccache --stop-server
sccache --start-server
sccache --zero-stats
set CC=sccache-cl
set CXX=sccache-cl
set CMAKE_C_COMPILER_LAUNCHER=sccache
set CMAKE_CXX_COMPILER_LAUNCHER=sccache
set CMAKE_GENERATOR=Ninja

View File

@ -1,14 +0,0 @@
if "%REBUILD%"=="" (
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z
) else (
aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet
)
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
7z x -aoa %TMP_DIR_WIN%\mkl.7z -o%TMP_DIR_WIN%\mkl
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)
set CMAKE_INCLUDE_PATH=%TMP_DIR_WIN%\mkl\include
set LIB=%TMP_DIR_WIN%\mkl\lib;%LIB%

View File

@ -1,18 +1,13 @@
mkdir %TMP_DIR_WIN%\bin
if "%REBUILD%"=="" (
:check_sccache
%TMP_DIR_WIN%\bin\sccache.exe --show-stats || (
IF EXIST %TMP_DIR_WIN%\bin\sccache.exe (
taskkill /im sccache.exe /f /t || ver > nul
del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul
del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe
) else (
aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe
aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe
)
goto :check_sccache
)
)
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-v0.7.4.exe --output %TMP_DIR_WIN%\bin\sccache.exe
) else (
aws s3 cp s3://ossci-windows/sccache-v0.7.4.exe %TMP_DIR_WIN%\bin\sccache.exe
)
)

View File

@ -1,468 +1,4 @@
Warning
=======
Contents may be out of date. Our CircleCI workflows are gradually being migrated to Github actions.
Structure of CI
===============
setup job:
1. Does a git checkout
2. Persists CircleCI scripts (everything in `.circleci`) into a workspace. Why?
We don't always do a Git checkout on all subjobs, but we usually
still want to be able to call scripts one way or another in a subjob.
Persisting files this way lets us have access to them without doing a
checkout. This workspace is conventionally mounted on `~/workspace`
(this is distinguished from `~/project`, which is the conventional
working directory that CircleCI will default to starting your jobs
in.)
3. Write out the commit message to `.circleci/COMMIT_MSG`. This is so
we can determine in subjobs if we should actually run the jobs or
not, even if there isn't a Git checkout.
CircleCI configuration generator
================================
One may no longer make changes to the `.circleci/config.yml` file directly.
Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.
Usage
----------
1. Make changes to these scripts.
2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.
Motivation
----------
These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.
The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.
Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate
multiple parts of the file.
* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets
Also see https://github.com/pytorch/pytorch/issues/17038
Future direction
----------------
### Declaring sparse config subsets
See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
In contrast with a full recursive tree traversal of configuration dimensions,
> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
----------------
----------------
# How do the binaries / nightlies / releases work?
### What is a binary?
A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
A **binary configuration** is a collection of
* release or nightly
* releases are stable, nightlies are beta and built every night
* python version
* linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
* macos: 3.7, 3.8
* windows: 3.7, 3.8
* cpu version
* cpu, cuda 9.0, cuda 10.0
* The supported cuda versions occasionally change
* operating system
* Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
* MacOS
* Windows - these are built on Azure pipelines
* devtoolset version (gcc compiler version)
* This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
### Where are the binaries?
The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
We have 3 types of binary packages
* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
* shared with dependencies (the only supported option for Windows)
* static with dependencies
* shared without dependencies
* static without dependencies
All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
# CircleCI structure of the binaries
Some quick vocab:
* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.
* **jobs** are a sequence of '**steps**'
* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
## How are the workflows structured?
The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build, test, and upload) per binary configuration
1. binary_builds
1. every day midnight EST
2. linux: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
3. macos: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml
4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. binary_linux_conda_3.7_cpu_build
1. Builds the build. On linux jobs this uses the 'docker executor'.
2. Persists the package to the workspace
2. binary_linux_conda_3.7_cpu_test
1. Loads the package to the workspace
2. Spins up a docker image (on Linux), mapping the package and code repos into the docker
3. Runs some smoke tests in the docker
4. (Actually, for macos this is a step rather than a separate job)
3. binary_linux_conda_3.7_cpu_upload
1. Logs in to aws/conda
2. Uploads the package
2. update_s3_htmls
1. every day 5am EST
2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml
3. See below for what these are for and why they're needed
4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3
3. binarysmoketests
1. every day
2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. smoke_linux_conda_3.7_cpu
1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
2. Runs the smoke tests
## How are the jobs structured?
The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .
* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* binary_linux_build.sh
* binary_linux_test.sh
* binary_linux_upload.sh
* MacOS jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml
* binary_macos_build.sh
* binary_macos_test.sh
* binary_macos_upload.sh
* Update html jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml
* These delegate from the pytorch/builder repo
* https://github.com/pytorch/builder/blob/main/cron/update_s3_htmls.sh
* https://github.com/pytorch/builder/blob/main/cron/upload_binary_sizes.sh
* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
* These delegate from the pytorch/builder repo
* https://github.com/pytorch/builder/blob/main/run_tests.sh
* https://github.com/pytorch/builder/blob/main/smoke_test.sh
* https://github.com/pytorch/builder/blob/main/check_binary.sh
* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
* binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
* binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
* binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
* binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
### **Why do the steps all refer to scripts?**
CircleCI creates a final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
### **What is binary_run_in_docker for?**
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
# Code structure of the binaries (circleci agnostic)
## Overview
The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is
```
# All code needed to set-up environments for build code to run in,
# but only code that is specific to the current CI system
pytorch/pytorch
- .circleci/ # Folder that holds all circleci related stuff
- config.yml # GENERATED file that actually controls all circleci behavior
- verbatim-sources # Used to generate job/workflow sections in ^
- scripts/ # Code needed to prepare circleci environments for binary build scripts
- setup.py # Builds pytorch. This is wrapped in pytorch/builder
- cmake files # used in normal building of pytorch
# All code needed to prepare a binary build, given an environment
# with all the right variables/packages/paths.
pytorch/builder
# Given an installed binary and a proper python env, runs some checks
# to make sure the binary was built the proper way. Checks things like
# the library dependencies, symbols present, etc.
- check_binary.sh
# Given an installed binary, runs python tests to make sure everything
# is in order. These should be de-duped. Right now they both run smoke
# tests, but are called from different places. Usually just call some
# import statements, but also has overlap with check_binary.sh above
- run_tests.sh
- smoke_test.sh
# Folders that govern how packages are built. See paragraphs below
- conda/
- build_pytorch.sh # Entrypoint. Delegates to proper conda build folder
- switch_cuda_version.sh # Switches activate CUDA installation in Docker
- pytorch-nightly/ # Build-folder
- manywheel/
- build_cpu.sh # Entrypoint for cpu builds
- build.sh # Entrypoint for CUDA builds
- build_common.sh # Actual build script that ^^ call into
- wheel/
- build_wheel.sh # Entrypoint for wheel builds
- windows/
- build_pytorch.bat # Entrypoint for wheel builds on Windows
```
Every type of package has an entrypoint build script that handles the all the important logic.
## Conda
Linux, MacOS and Windows use the same code flow for the conda builds.
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
tl;dr on conda-build is
1. Creates a brand new conda environment, based off of deps in the meta.yaml
1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
2. Calls build.sh in the environment
3. Copies the finished package to a new conda env, also specified by the meta.yaml
4. Runs some simple import tests (if specified in the meta.yaml)
5. Saves the finished package as a tarball
The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The entrypoint file `builder/conda/build_conda.sh` is complicated because
* It works for Linux, MacOS and Windows
* The mac builds used to create their own environments, since they all used to be on the same machine. Theres now a lot of extra logic to handle conda envs. This extra machinery could be removed
* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
## Manywheels (linux pip and libtorch packages)
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why theres testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
## Wheels (MacOS pip and libtorch packages)
The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
* The mac builds used to all run on one machine (we didnt have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* Ditto the comment above. This should definitely be separated out.
Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## Windows Wheels (Windows pip and libtorch packages)
The entrypoint file `builder/windows/build_pytorch.bat` is complicated because
* This used to handle building for several different python versions at the same time. This is why there are loops everywhere
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why theres testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## General notes
### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
* These should all be consolidated
* These must run on all OS types: MacOS, Linux, and Windows
* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didnt mess anything up.
* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
### Note on libtorch
Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
* Its confusing. Most of those scripts deal with python specifics.
* The extra conditionals everywhere severely complicate the wheel build scripts
* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
### Note on docker images / Dockerfiles
All linux builds occur in docker images. The docker images are
* pytorch/conda-cuda
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
* Also used for cpu builds
* pytorch/manylinux-cuda90
* pytorch/manylinux-cuda100
* Also used for cpu builds
The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
### General Python
* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
# How to manually rebuild the binaries
tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
## How to test changes to the binaries via .circleci
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
```sh
# Make your changes
touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
# Regenerate the yaml, has to be in python 3.7
.circleci/regenerate.sh
# Make a commit
git add .circleci *
git commit -m "My real changes"
git push origin my_branch
# Now hardcode the jobs that you want in the .circleci/config.yml workflows section
# Also eliminate ensure-consistency and should_run_job checks
# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d
# Make a commit you won't keep
git add .circleci
git commit -m "[DO NOT LAND] testing binaries for above changes"
git push origin my_branch
# Now you need to make some changes to the first commit.
git rebase -i HEAD~2 # mark the first commit as 'edit'
# Make the changes
touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
.circleci/regenerate.sh
# Ammend the commit and recontinue
git add .circleci
git commit --amend
git rebase --continue
# Update the PR, need to force since the commits are different now
git push origin my_branch --force
```
The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
## How to build a binary locally
### Linux
You can build Linux binaries locally easily using docker.
```sh
# Run the docker
# Use the correct docker image, pytorch/conda-cuda used here as an example
#
# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
# machine that you're running the command on) accessible to the docker
# container at path/to/bar. So if you then run `touch path/to/bar/baz`
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you know how, add ccache as a volume too and speed up everything
docker run \
-v your/pytorch/repo:/pytorch \
-v your/builder/repo:/builder \
-v where/you/want/packages/to/appear:/final_pkgs \
-it pytorch/conda-cuda /bin/bash
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
export PACKAGE_TYPE=conda
export DESIRED_PYTHON=3.7
export DESIRED_CUDA=cpu
# Call the entrypoint
# `|& tee foo.log` just copies all stdout and stderr output to foo.log
# The builds generate lots of output so you probably need this when
# building locally.
/builder/conda/build_pytorch.sh |& tee build_output.log
```
**Building CUDA binaries on docker**
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though its gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
### MacOS
Theres no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If youre trying to repro an error on a Mac build in .circleci and you cant seem to repro locally, then my best advice is actually to iterate on .circleci :/
But if you want to try, then Id recommend
```sh
# Create a new terminal
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
# know how to do
# Install a new miniconda
# First remove any other python or conda installation from your PATH
# Always install miniconda 3, even if building for Python <3
new_conda="~/my_new_conda"
conda_sh="$new_conda/install_miniconda.sh"
curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"
rm -f "$conda_sh"
export PATH="~/my_new_conda/bin:$PATH"
# Create a clean python env
# All MacOS builds use conda to manage the python env and dependencies
# that are built with, even the pip packages
conda create -yn binary python=2.7
conda activate binary
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
export PACKAGE_TYPE=conda
export DESIRED_PYTHON=3.7
export DESIRED_CUDA=cpu
# Call the entrypoint you want
path/to/builder/wheel/build_wheel.sh
```
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
1. You make the conda command accessible by prepending `path/to/conda_root/bin` to your PATH.
2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
3. Now say you (or some code that you ran) call python executable `foo`
1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called base), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
### Windows
TODO: fill in
PyTorch migration from CircleCI to github actions has been completed. All continuous integration & deployment workflows are defined in `.github/workflows` folder

View File

@ -1,69 +0,0 @@
#!/bin/bash
set -eux -o pipefail
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# This step runs on multiple executors with different envfile locations
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
rm -rf /c/w
ln -s "/c/Users/circleci/project" /c/w
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
else
# docker executor (binary builds)
workdir="/"
fi
# It is very important that this stays in sync with binary_populate_env.sh
if [[ "$OSTYPE" == "msys" ]]; then
# We need to make the paths as short as possible on Windows
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
# Try to extract PR number from branch if not already set
if [[ -z "${CIRCLE_PR_NUMBER:-}" ]]; then
CIRCLE_PR_NUMBER="$(echo ${CIRCLE_BRANCH} | sed -E -n 's/pull\/([0-9]*).*/\1/p')"
fi
# Clone the Pytorch branch
retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"
pushd "$PYTORCH_ROOT"
if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then
# "smoke" binary build on PRs
git fetch --force origin "pull/${CIRCLE_PR_NUMBER}/head:remotes/origin/pull/${CIRCLE_PR_NUMBER}"
git reset --hard "$CIRCLE_SHA1"
git checkout -q -B "$CIRCLE_BRANCH"
git reset --hard "$CIRCLE_SHA1"
elif [[ -n "${CIRCLE_SHA1:-}" ]]; then
# Scheduled workflows & "smoke" binary build on trunk on PR merges
DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"
git reset --hard "$CIRCLE_SHA1"
git checkout -q -B $DEFAULT_BRANCH
else
echo "Can't tell what to checkout"
exit 1
fi
retry git submodule update --init --recursive
echo "Using Pytorch from "
git --no-pager log --max-count 1
popd
# Clone the Builder main repo
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
pushd "$BUILDER_ROOT"
echo "Using builder from "
git --no-pager log --max-count 1
popd

View File

@ -1,44 +0,0 @@
#!/bin/bash
set -eux -o pipefail
# This step runs on multiple executors with different envfile locations
if [[ "$(uname)" == Darwin ]]; then
envfile="/Users/distiller/project/env"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
envfile="/home/circleci/project/env"
else
# docker executor (binary builds)
envfile="/env"
fi
# TODO this is super hacky and ugly. Basically, the binary_update_html job does
# not have an env file, since it does not call binary_populate_env.sh, since it
# does not have a BUILD_ENVIRONMENT. So for this one case, which we detect by a
# lack of an env file, we manually export the environment variables that we
# need to install miniconda
if [[ ! -f "$envfile" ]]; then
MINICONDA_ROOT="/home/circleci/project/miniconda"
workdir="/home/circleci/project"
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
export -f retry
else
source "$envfile"
fi
conda_sh="$workdir/install_miniconda.sh"
if [[ "$(uname)" == Darwin ]]; then
curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh
else
curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
fi
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"
rm -f "$conda_sh"
# We can't actually add miniconda to the PATH in the envfile, because that
# breaks 'unbuffer' in Mac jobs. This is probably because conda comes with
# a tclsh, which then gets inserted before the tclsh needed in /usr/bin

View File

@ -4,10 +4,6 @@ set -eux -o pipefail
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
if [[ -z "${GITHUB_ACTIONS:-}" ]]; then
export PATH="${workdir:-${HOME}}/miniconda/bin:${PATH}"
fi
# Build
export USE_PYTORCH_METAL_EXPORT=1
export USE_COREML_DELEGATE=1

View File

@ -3,17 +3,9 @@ set -eux -o pipefail
export TZ=UTC
tagged_version() {
# Grabs version from either the env variable CIRCLE_TAG
# or the pytorch git described version
if [[ "$OSTYPE" == "msys" && -z "${GITHUB_ACTIONS:-}" ]]; then
GIT_DIR="${workdir}/p/.git"
else
GIT_DIR="${workdir}/pytorch/.git"
fi
GIT_DIR="${workdir}/pytorch/.git"
GIT_DESCRIBE="git --git-dir ${GIT_DIR} describe --tags --match v[0-9]*.[0-9]*.[0-9]*"
if [[ -n "${CIRCLE_TAG:-}" ]]; then
echo "${CIRCLE_TAG}"
elif [[ ! -d "${GIT_DIR}" ]]; then
if [[ ! -d "${GIT_DIR}" ]]; then
echo "Abort, abort! Git dir ${GIT_DIR} does not exists!"
kill $$
elif ${GIT_DESCRIBE} --exact >/dev/null; then
@ -59,6 +51,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
BASE_BUILD_VERSION="$(cat ${PYTORCH_ROOT}/version.txt|cut -da -f1).dev${DATE}"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -78,6 +71,35 @@ fi
export PYTORCH_BUILD_NUMBER=1
# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.12 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
TRITON_REQUIREMENT="pytorch-triton==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
else
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
fi
JAVA_HOME=
BUILD_JNI=OFF
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
@ -123,12 +145,13 @@ if [[ "${OSTYPE}" == "msys" ]]; then
else
export DESIRED_DEVTOOLSET="${DESIRED_DEVTOOLSET:-}"
fi
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.14.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"
# TODO: We don't need this anymore IIUC
export TORCH_PACKAGE_NAME='torch'
@ -161,28 +184,6 @@ if [[ "$(uname)" != Darwin ]]; then
EOL
fi
if [[ -z "${GITHUB_ACTIONS:-}" ]]; then
cat >>"$envfile" <<EOL
export workdir="$workdir"
export MAC_PACKAGE_WORK_DIR="$workdir"
if [[ "$OSTYPE" == "msys" ]]; then
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
export MINICONDA_ROOT="$workdir/miniconda"
export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
EOL
fi
echo 'retry () {' >> "$envfile"
echo ' $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)' >> "$envfile"
echo '}' >> "$envfile"

View File

@ -1,29 +0,0 @@
#!/bin/bash
# This section is used in the binary_test and smoke_test jobs. It expects
# 'binary_populate_env' to have populated /home/circleci/project/env and it
# expects another section to populate /home/circleci/project/ci_test_script.sh
# with the code to run in the docker
# Expect all needed environment variables to be written to this file
source /home/circleci/project/env
echo "Running the following code in Docker"
cat /home/circleci/project/ci_test_script.sh
echo
echo
set -eux -o pipefail
# Expect actual code to be written to this file
chmod +x /home/circleci/project/ci_test_script.sh
VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"
# Run the docker
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
fi
# Execute the test script that was populated by an earlier section
export COMMAND='((echo "source /circleci_stuff/env && /circleci_stuff/ci_test_script.sh") | docker exec -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

View File

@ -1,111 +0,0 @@
#!/usr/bin/env bash
set -ex -o pipefail
# Remove unnecessary sources
sudo rm -f /etc/apt/sources.list.d/google-chrome.list
sudo rm -f /etc/apt/heroku.list
sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
sudo rm -f /etc/apt/partner.list
# To increase the network reliability, let apt decide which mirror is best to use
sudo sed -i -e 's/http:\/\/.*archive/mirror:\/\/mirrors/' -e 's/\/ubuntu\//\/mirrors.txt/' /etc/apt/sources.list
retry () {
$* || $* || $* || $* || $*
}
# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading
# (with use of tee to avoid permissions problems)
# This is better than retrying the whole apt-get command
echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries
retry sudo apt-get update -qq
retry sudo apt-get -y install \
moreutils \
expect-dev
echo "== DOCKER VERSION =="
docker version
if ! command -v aws >/dev/null; then
retry sudo pip3 -q install awscli==1.19.64
fi
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-515.76.run"
wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
nvidia-smi
# Taken directly from https://github.com/NVIDIA/nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")
curl -s -L --retry 3 --retry-all-errors https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L --retry 3 --retry-all-errors "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
retry sudo apt-get update -qq
# Necessary to get the `--gpus` flag to function within docker
retry sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
else
# Explicitly remove nvidia docker apt repositories if not building for cuda
sudo rm -rf /etc/apt/sources.list.d/nvidia-docker.list
fi
add_to_env_file() {
local name=$1
local value=$2
case "$value" in
*\ *)
# BASH_ENV should be set by CircleCI
echo "${name}='${value}'" >> "${BASH_ENV:-/tmp/env}"
;;
*)
echo "${name}=${value}" >> "${BASH_ENV:-/tmp/env}"
;;
esac
}
add_to_env_file CI_MASTER "${CI_MASTER:-}"
add_to_env_file COMMIT_SOURCE "${CIRCLE_BRANCH:-}"
add_to_env_file BUILD_ENVIRONMENT "${BUILD_ENVIRONMENT}"
add_to_env_file CIRCLE_PULL_REQUEST "${CIRCLE_PULL_REQUEST}"
if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
add_to_env_file SCCACHE_BUCKET ossci-compiler-cache-circleci-v2
SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))
MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
add_to_env_file MAX_JOBS "${MAX_JOBS}"
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
add_to_env_file TORCH_CUDA_ARCH_LIST 5.2
fi
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# This IAM user allows write access to S3 bucket for sccache & bazels3cache
set +x
add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
set -x
else
# This IAM user allows write access to S3 bucket for sccache
set +x
add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
set -x
fi
fi
# This IAM user only allows read-write access to ECR
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x

View File

@ -1,50 +0,0 @@
#!/usr/bin/env bash
set -eux -o pipefail
# Set up CircleCI GPG keys for apt, if needed
curl --retry 3 --retry-all-errors -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
# Stop background apt updates. Hypothetically, the kill should not
# be necessary, because stop is supposed to send a kill signal to
# the process, but we've added it for good luck. Also
# hypothetically, it's supposed to be unnecessary to wait for
# the process to block. We also have that line for good luck.
# If you like, try deleting them and seeing if it works.
sudo systemctl stop apt-daily.service || true
sudo systemctl kill --kill-who=all apt-daily.service || true
sudo systemctl stop unattended-upgrades.service || true
sudo systemctl kill --kill-who=all unattended-upgrades.service || true
# wait until `apt-get update` has been killed
while systemctl is-active --quiet apt-daily.service
do
sleep 1;
done
while systemctl is-active --quiet unattended-upgrades.service
do
sleep 1;
done
# See if we actually were successful
systemctl list-units --all | cat
# For good luck, try even harder to kill apt-get
sudo pkill apt-get || true
# For even better luck, purge unattended-upgrades
sudo apt-get purge -y unattended-upgrades || true
cat /etc/apt/sources.list
# For the bestest luck, kill again now
sudo pkill apt || true
sudo pkill dpkg || true
# Try to detect if apt/dpkg is stuck
if ps auxfww | grep '[a]pt'; then
echo "WARNING: There are leftover apt processes; subsequent apt update will likely fail"
fi
if ps auxfww | grep '[d]pkg'; then
echo "WARNING: There are leftover dpkg processes; subsequent apt update will likely fail"
fi

View File

@ -42,7 +42,6 @@ misc-*,
-misc-non-private-member-variables-in-classes,
-misc-confusable-identifiers,
modernize-*,
-modernize-concat-nested-namespaces,
-modernize-macro-to-enum,
-modernize-return-braced-init-list,
-modernize-use-auto,

View File

@ -30,5 +30,5 @@ RUN if [ -n "$CLANG_VERSION" ]; then \
# Install cuda if version is specified
ARG CUDA_VERSION
RUN if [ -n "$CUDA_VERSION" ]; then \
conda install cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \
conda install -y cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \
fi

View File

@ -46,7 +46,7 @@ If you are using [Visual Studio Code Remote - SSH](https://code.visualstudio.com
## Step 6: Open in DevContainer
1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Remote-Containers: Open Folder in Container..." command.
1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Dev Containers: Open Folder in Container..." command.
2. You will be prompted with two options: CPU dev container or CUDA dev container. Choose the one you want to run.
## Step 7: Wait for Building the Environment

22
.flake8
View File

@ -2,7 +2,7 @@
# NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
# before we can fully move to use ruff
enable-extensions = G
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
@ -27,6 +27,9 @@ ignore =
# TODO(kit1980): fix all TOR102 issues
# `torch.load` without `weights_only` parameter is unsafe
TOR102,
# TODO(kit1980): resolve all TOR003 issues
# pass `use_reentrant` explicitly to `checkpoint`.
TOR003
per-file-ignores =
__init__.py: F401
test/**: F821
@ -34,6 +37,23 @@ per-file-ignores =
torch/utils/cpp_extension.py: B950
torchgen/api/types/__init__.py: F401,F403
torchgen/executorch/api/types/__init__.py: F401,F403
test/dynamo/test_higher_order_ops.py: B950
torch/testing/_internal/dynamo_test_failures.py: B950
# TOR901 is only for test, we want to ignore it for everything else.
# It's not easy to configure this without affecting other per-file-ignores,
# so we explicitly list every file where it's violated outside of test.
torch/__init__.py: F401,TOR901
torch/_custom_op/impl.py: TOR901
torch/_export/serde/upgrade.py: TOR901
torch/_functorch/vmap.py: TOR901
torch/_inductor/test_operators.py: TOR901
torch/_library/abstract_impl.py: TOR901
torch/_meta_registrations.py: TOR901
torch/_prims/__init__.py: F401,TOR901
torch/_prims/rng_prims.py: TOR901
torch/ao/quantization/fx/_decomposed.py: TOR901
torch/distributed/_functional_collectives.py: TOR901
torch/distributed/_spmd/data_parallel.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

View File

@ -19,7 +19,7 @@ self-hosted-runner:
- windows.g5.4xlarge.nvidia.gpu
- bm-runner
- linux.rocm.gpu
- macos-m1-12
- macos-m1-stable
- macos-m1-13
- macos-12-xl
- macos-12

View File

@ -0,0 +1,29 @@
name: Download TD Artifacts
description: Download artifacts from target_determination.yml
inputs:
use-gha:
description: If set to any value, use GHA to download the artifact. Otherwise use s3.
required: false
runs:
using: composite
steps:
- name: Download TD Artifacts from S3
if: ${{ !inputs.use-gha }}
uses: seemethere/download-artifact-s3@v4
with:
name: td_results
- name: Download TD Artifacts from GHA
if: inputs.use-gha
uses: actions/download-artifact@v3
with:
name: td_results.json
- name: Move artifacts to .additional_ci_files folder
shell: bash
run: |
mkdir -p .additional_ci_files
mv td_results.json .additional_ci_files/td_results.json

View File

@ -26,11 +26,20 @@ outputs:
description: True if the filtered test configs matrix is empty. False otherwise.
value: ${{ steps.filter.outputs.is-test-matrix-empty }}
keep-going:
description: True if keep-going label was on PR.
description: True if keep-going label was on PR or [keep-going] in PR body.
value: ${{ steps.filter.outputs.keep-going }}
reenabled-issues:
description: Comma separated list of issue numbers that should correspond to disable test issues that the PR fixes
value: ${{ steps.filter.outputs.reenabled-issues }}
ci-verbose-test-logs:
description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.
value: ${{ steps.filter.outputs.ci-verbose-test-logs }}
ci-no-test-timeout:
description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.
value: ${{ steps.filter.outputs.ci-no-test-timeout }}
ci-no-td:
description: True if ci-no-td label was on PR or [ci-no-td] in PR body.
value: ${{ steps.filter.outputs.ci-no-td }}
runs:
using: composite

View File

@ -9,6 +9,16 @@ runs:
shell: bash
run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"
- name: Remove leftover Docker config file
shell: bash
continue-on-error: true
run: |
set -ex
cat ~/.docker/config.json || true
# https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not
rm -f ~/.docker/config.json
- name: Stop all running docker containers
if: always()
shell: bash

View File

@ -1,59 +0,0 @@
name: Update commit hash
inputs:
repo-owner:
required: false
type: string
description: Name of repository's owner.
default: pytorch
repo-name:
required: true
type: string
description: Name of the repository we're updating commit hash for.
branch:
required: true
type: string
description: Branch to fetch commit of
pin-folder:
type: string
description: Path to folder with commit pin
required: false
default: .github/ci_commit_pins
updatebot-token:
required: true
type: string
description: update bot token
pytorchbot-token:
required: true
type: string
description: update bot token
description: update commit hash
runs:
using: composite
steps:
- name: Checkout repo
uses: actions/checkout@v3
with:
fetch-depth: 1
submodules: false
token: ${{ inputs.updatebot-token }}
- name: Checkout
shell: bash
run: |
git clone https://github.com/${{ inputs.repo-owner }}/${{ inputs.repo-name }}.git --quiet
- name: Check if there already exists a PR
shell: bash
env:
REPO_NAME: ${{ inputs.repo-name }}
BRANCH: ${{ inputs.branch }}
PIN_FOLDER: ${{ inputs.pin-folder }}
UPDATEBOT_TOKEN: ${{ inputs.updatebot-token }}
PYTORCHBOT_TOKEN: ${{ inputs.pytorchbot-token }}
NEW_BRANCH_NAME: update-${{ inputs.repo-name }}-commit-hash/${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}
run: |
# put this here instead of the script to prevent accidentally changing the config when running the script locally
git config --global user.name "PyTorch UpdateBot"
git config --global user.email "pytorchupdatebot@users.noreply.github.com"
python .github/scripts/update_commit_hashes.py --repo-name "${REPO_NAME}" --branch "${BRANCH}" --pin-folder "${PIN_FOLDER}"

View File

@ -6,7 +6,6 @@ reviewers:
- albanD
- miladm
- bdhirsh
- voznesenskym
per_author:
symbolic-shapes:

View File

@ -1 +1 @@
e3efbc2d9094685dd2d4ae143853941f82f167af
87aeb554d3e2f7855b7abe5120c282f59648ed7a

View File

@ -1 +1 @@
99944a2fb8624947f9c0e2edc898ff42a16124da
d6015d42d9a1834bc7595c4bd6852562fb80b30b

View File

@ -1 +1 @@
d23430765b5df76cd1267f438f129f51b7d6e3e1
2c127da8b5e2e8f44b50994c6cb931bcca267cfe

View File

@ -1 +1 @@
e1c94dfa5a74331a376537c23bf74a2c367f24bd
r2.3

6
.github/labeler.yml vendored
View File

@ -26,6 +26,11 @@
- .github/ci_commit_pins/**
- c10/core/Sym*
- torch/fx/experimental/symbolic_shapes.py
- torch/fx/experimental/recording.py
- torch/fx/experimental/sym_node.py
- torch/fx/experimental/validator.py
- torch/fx/experimental/_sym_dispatch_mode.py
- torch/fx/experimental/proxy_tensor.py
- test/distributed/_tensor/test_dtensor_compile.py
- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
- torch/distributed/_tensor/**
@ -39,6 +44,7 @@
- aten/src/ATen/native/mkldnn/**
- torch/cpu/**
- torch/utils/mkldnn.py
- torch/utils/_sympy/**
- test/test_mkldnn.py
"module: mkldnn":

View File

@ -275,17 +275,20 @@
- wanchaol
- fduwjj
- H-Huang
- aazzolini
- kwen2501
- XilunWu
- wz337
- awgu
- fegin
- kumpera
- yhcharles
- kurman
- LucasLLC
- sanketpurandare
- shuqiangzhang
- tianyu-l
- kiukchung
- d4l3k
- shuqiangzhang
- weifengpy
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -1,6 +1,5 @@
# This file is to cache other dependencies not specified elsewhere in:
# requirement.txt
# requirements-flake8.txt
# docs/requirements.txt
# docs/cpp/requirements.txt
# functorch/docs/requirements.txt

View File

@ -4,6 +4,6 @@ mkl-include=2022.1.0
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
requests=2.28.1
setuptools=65.5.0
requests=2.31.0
setuptools=68.2.2
typing-extensions=4.3.0

View File

@ -3,6 +3,6 @@ cmake=3.22.1
ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
requests=2.28.1
setuptools=63.4.1
requests=2.31.0
setuptools=68.2.2
typing-extensions=4.3.0

View File

@ -16,7 +16,6 @@ pytest==7.3.2
pytest-xdist==3.3.1
pytest-rerunfailures==10.3
pytest-flakefinder==1.1.0
pytest-shard==0.1.2
scipy==1.10.1
sympy==1.11.1
unittest-xml-reporting<=3.2.0,>=2.0.0
@ -28,3 +27,6 @@ rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.9.1
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

View File

@ -10,9 +10,6 @@ from typing import Optional
SCRIPT_DIR = Path(__file__).parent
REPO_DIR = SCRIPT_DIR.parent.parent
# TODO: Remove me once Triton version is again in sync for vanilla and ROCm
ROCM_TRITION_VERSION = "2.1.0"
def read_triton_pin(rocm_hash: bool = False) -> str:
triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"
@ -99,7 +96,14 @@ def build_triton(
triton_repo = "https://github.com/openai/triton"
triton_pkg_name = "pytorch-triton"
check_call(["git", "clone", triton_repo], cwd=tmpdir)
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if release:
ver, rev, patch = version.split(".")
check_call(
["git", "checkout", f"release/{ver}.{rev}.x"], cwd=triton_basedir
)
else:
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(
@ -155,7 +159,7 @@ def build_triton(
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}",
expected_version=ROCM_TRITION_VERSION if build_rocm else None,
expected_version=None,
)
if build_rocm:
@ -164,7 +168,7 @@ def build_triton(
triton_pythondir / "setup.py",
name=triton_pkg_name,
version=f"{version}",
expected_version=ROCM_TRITION_VERSION,
expected_version=None,
)
check_call("scripts/amd/setup_rocm_libs.sh", cwd=triton_basedir, shell=True)
print("ROCm libraries setup for triton installation...")

223
.github/scripts/cherry_pick.py vendored Executable file
View File

@ -0,0 +1,223 @@
#!/usr/bin/env python3
import json
import os
import re
from typing import Any, Optional
from urllib.error import HTTPError
from github_utils import gh_fetch_url, gh_post_pr_comment
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import get_pr_commit_sha, GitHubPR
# This is only a suggestion for now, not a strict requirement
REQUIRES_ISSUE = {
"regression",
"critical",
"fixnewfeature",
}
def parse_args() -> Any:
from argparse import ArgumentParser
parser = ArgumentParser("cherry pick a landed PR onto a release branch")
parser.add_argument(
"--onto-branch", type=str, required=True, help="the target release branch"
)
parser.add_argument(
"--github-actor", type=str, required=True, help="all the worlds a stage"
)
parser.add_argument(
"--classification",
choices=["regression", "critical", "fixnewfeature", "docs", "release"],
required=True,
help="the cherry pick category",
)
parser.add_argument("pr_num", type=int)
parser.add_argument(
"--fixes",
type=str,
default="",
help="the GitHub issue that the cherry pick fixes",
)
parser.add_argument("--dry-run", action="store_true")
return parser.parse_args()
def get_merge_commit_sha(repo: GitRepo, pr: GitHubPR) -> Optional[str]:
"""
Return the merge commit SHA iff the PR has been merged. For simplicity, we
will only cherry pick PRs that have been merged into main
"""
commit_sha = get_pr_commit_sha(repo, pr)
return commit_sha if pr.is_closed() else None
def cherry_pick(
github_actor: str,
repo: GitRepo,
pr: GitHubPR,
commit_sha: str,
onto_branch: str,
classification: str,
fixes: str,
dry_run: bool = False,
) -> None:
"""
Create a local branch to cherry pick the commit and submit it as a pull request
"""
current_branch = repo.current_branch()
cherry_pick_branch = create_cherry_pick_branch(
github_actor, repo, pr, commit_sha, onto_branch
)
try:
if not dry_run:
org, project = repo.gh_owner_and_name()
cherry_pick_pr = submit_pr(repo, pr, cherry_pick_branch, onto_branch)
msg = f"The cherry pick PR is at {cherry_pick_pr}"
if fixes:
msg += f" and it is linked with issue {fixes}"
elif classification in REQUIRES_ISSUE:
msg += f" and it is recommended to link a {classification} cherry pick PR with an issue"
post_comment(org, project, pr.pr_num, msg)
finally:
if current_branch:
repo.checkout(branch=current_branch)
def create_cherry_pick_branch(
github_actor: str, repo: GitRepo, pr: GitHubPR, commit_sha: str, onto_branch: str
) -> str:
"""
Create a local branch and cherry pick the commit. Return the name of the local
cherry picking branch.
"""
repo.checkout(branch=onto_branch)
repo._run_git("submodule", "update", "--init", "--recursive")
# Remove all special characters if we want to include the actor in the branch name
github_actor = re.sub("[^0-9a-zA-Z]+", "_", github_actor)
cherry_pick_branch = f"cherry-pick-{pr.pr_num}-by-{github_actor}"
repo.create_branch_and_checkout(branch=cherry_pick_branch)
# We might want to support ghstack later
repo._run_git("cherry-pick", "-x", "-X", "theirs", commit_sha)
repo.push(branch=cherry_pick_branch, dry_run=False)
return cherry_pick_branch
def submit_pr(
repo: GitRepo,
pr: GitHubPR,
cherry_pick_branch: str,
onto_branch: str,
) -> str:
"""
Submit the cherry pick PR and return the link to the PR
"""
org, project = repo.gh_owner_and_name()
default_msg = f"Cherry pick #{pr.pr_num} onto {onto_branch} branch"
title = pr.info.get("title", default_msg)
body = pr.info.get("body", default_msg)
try:
response = gh_fetch_url(
f"https://api.github.com/repos/{org}/{project}/pulls",
method="POST",
data={
"title": title,
"body": body,
"head": cherry_pick_branch,
"base": onto_branch,
},
headers={"Accept": "application/vnd.github.v3+json"},
reader=json.load,
)
cherry_pick_pr = response.get("html_url", "")
if not cherry_pick_pr:
raise RuntimeError(
f"Fail to find the cherry pick PR: {json.dumps(response)}"
)
return str(cherry_pick_pr)
except HTTPError as error:
msg = f"Fail to submit the cherry pick PR: {error}"
raise RuntimeError(msg) from error
def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:
"""
Post a comment on the PR itself to point to the cherry picking PR when success
or print the error when failure
"""
internal_debugging = ""
run_url = os.getenv("GH_RUN_URL")
# Post a comment to tell folks that the PR is being cherry picked
if run_url is not None:
internal_debugging = "\n".join(
line
for line in (
"<details><summary>Details for Dev Infra team</summary>",
f'Raised by <a href="{run_url}">workflow job</a>\n',
"</details>",
)
if line
)
comment = "\n".join(
(f"### Cherry picking #{pr_num}", f"{msg}", "", f"{internal_debugging}")
)
gh_post_pr_comment(org, project, pr_num, comment)
def main() -> None:
args = parse_args()
pr_num = args.pr_num
repo = GitRepo(get_git_repo_dir(), get_git_remote_name())
org, project = repo.gh_owner_and_name()
pr = GitHubPR(org, project, pr_num)
try:
commit_sha = get_merge_commit_sha(repo, pr)
if not commit_sha:
raise RuntimeError(
f"Refuse to cherry pick #{pr_num} because it hasn't been merged yet"
)
cherry_pick(
args.github_actor,
repo,
pr,
commit_sha,
args.onto_branch,
args.classification,
args.fixes,
args.dry_run,
)
except RuntimeError as error:
if not args.dry_run:
post_comment(org, project, pr_num, str(error))
else:
raise error
if __name__ == "__main__":
main()

274
.github/scripts/delete_old_branches.py vendored Normal file
View File

@ -0,0 +1,274 @@
# Delete old branches
import os
import re
from datetime import datetime
from pathlib import Path
from typing import Any, Callable, Dict, List, Set
from github_utils import gh_fetch_json_dict, gh_graphql
from gitutils import GitRepo
SEC_IN_DAY = 24 * 60 * 60
CLOSED_PR_RETENTION = 30 * SEC_IN_DAY
NO_PR_RETENTION = 1.5 * 365 * SEC_IN_DAY
PR_WINDOW = 90 * SEC_IN_DAY # Set to None to look at all PRs (may take a lot of tokens)
REPO_OWNER = "pytorch"
REPO_NAME = "pytorch"
ESTIMATED_TOKENS = [0]
TOKEN = os.environ["GITHUB_TOKEN"]
if not TOKEN:
raise Exception("GITHUB_TOKEN is not set")
REPO_ROOT = Path(__file__).parent.parent.parent
# Query for all PRs instead of just closed/merged because it's faster
GRAPHQL_ALL_PRS_BY_UPDATED_AT = """
query ($owner: String!, $repo: String!, $cursor: String) {
repository(owner: $owner, name: $repo) {
pullRequests(
first: 100
after: $cursor
orderBy: {field: UPDATED_AT, direction: DESC}
) {
totalCount
pageInfo {
hasNextPage
endCursor
}
nodes {
headRefName
number
updatedAt
state
}
}
}
}
"""
GRAPHQL_OPEN_PRS = """
query ($owner: String!, $repo: String!, $cursor: String) {
repository(owner: $owner, name: $repo) {
pullRequests(
first: 100
after: $cursor
states: [OPEN]
) {
totalCount
pageInfo {
hasNextPage
endCursor
}
nodes {
headRefName
number
updatedAt
state
}
}
}
}
"""
GRAPHQL_NO_DELETE_BRANCH_LABEL = """
query ($owner: String!, $repo: String!, $cursor: String) {
repository(owner: $owner, name: $repo) {
label(name: "no-delete-branch") {
pullRequests(first: 100, after: $cursor) {
totalCount
pageInfo {
hasNextPage
endCursor
}
nodes {
headRefName
number
updatedAt
state
}
}
}
}
}
"""
def is_protected(branch: str) -> bool:
try:
ESTIMATED_TOKENS[0] += 1
res = gh_fetch_json_dict(
f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/branches/{branch}"
)
return bool(res["protected"])
except Exception as e:
print(f"[{branch}] Failed to fetch branch protections: {e}")
return True
def convert_gh_timestamp(date: str) -> float:
return datetime.strptime(date, "%Y-%m-%dT%H:%M:%SZ").timestamp()
def get_branches(repo: GitRepo) -> Dict[str, Any]:
# Query locally for branches, group by branch base name (e.g. gh/blah/base -> gh/blah), and get the most recent branch
git_response = repo._run_git(
"for-each-ref",
"--sort=creatordate",
"--format=%(refname) %(committerdate:iso-strict)",
"refs/remotes/origin",
)
branches_by_base_name: Dict[str, Any] = {}
for line in git_response.splitlines():
branch, date = line.split(" ")
re_branch = re.match(r"refs/remotes/origin/(.*)", branch)
assert re_branch
branch = branch_base_name = re_branch.group(1)
if x := re.match(r"(gh\/.+)\/(head|base|orig)", branch):
branch_base_name = x.group(1)
date = datetime.fromisoformat(date).timestamp()
if branch_base_name not in branches_by_base_name:
branches_by_base_name[branch_base_name] = [date, [branch]]
else:
branches_by_base_name[branch_base_name][1].append(branch)
if date > branches_by_base_name[branch_base_name][0]:
branches_by_base_name[branch_base_name][0] = date
return branches_by_base_name
def paginate_graphql(
query: str,
kwargs: Dict[str, Any],
termination_func: Callable[[List[Dict[str, Any]]], bool],
get_data: Callable[[Dict[str, Any]], List[Dict[str, Any]]],
get_page_info: Callable[[Dict[str, Any]], Dict[str, Any]],
) -> List[Any]:
hasNextPage = True
endCursor = None
data: List[Dict[str, Any]] = []
while hasNextPage:
ESTIMATED_TOKENS[0] += 1
res = gh_graphql(query, cursor=endCursor, **kwargs)
data.extend(get_data(res))
hasNextPage = get_page_info(res)["hasNextPage"]
endCursor = get_page_info(res)["endCursor"]
if termination_func(data):
break
return data
def get_recent_prs() -> Dict[str, Any]:
now = datetime.now().timestamp()
# Grab all PRs updated in last CLOSED_PR_RETENTION days
pr_infos: List[Dict[str, Any]] = paginate_graphql(
GRAPHQL_ALL_PRS_BY_UPDATED_AT,
{"owner": "pytorch", "repo": "pytorch"},
lambda data: (
PR_WINDOW is not None
and (now - convert_gh_timestamp(data[-1]["updatedAt"]) > PR_WINDOW)
),
lambda res: res["data"]["repository"]["pullRequests"]["nodes"],
lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],
)
# Get the most recent PR for each branch base (group gh together)
prs_by_branch_base = {}
for pr in pr_infos:
pr["updatedAt"] = convert_gh_timestamp(pr["updatedAt"])
branch_base_name = pr["headRefName"]
if x := re.match(r"(gh\/.+)\/(head|base|orig)", branch_base_name):
branch_base_name = x.group(1)
if branch_base_name not in prs_by_branch_base:
prs_by_branch_base[branch_base_name] = pr
else:
if pr["updatedAt"] > prs_by_branch_base[branch_base_name]["updatedAt"]:
prs_by_branch_base[branch_base_name] = pr
return prs_by_branch_base
def get_branches_with_magic_label_or_open_pr() -> Set[str]:
pr_infos: List[Dict[str, Any]] = paginate_graphql(
GRAPHQL_NO_DELETE_BRANCH_LABEL,
{"owner": "pytorch", "repo": "pytorch"},
lambda data: False,
lambda res: res["data"]["repository"]["label"]["pullRequests"]["nodes"],
lambda res: res["data"]["repository"]["label"]["pullRequests"]["pageInfo"],
)
pr_infos.extend(
paginate_graphql(
GRAPHQL_OPEN_PRS,
{"owner": "pytorch", "repo": "pytorch"},
lambda data: False,
lambda res: res["data"]["repository"]["pullRequests"]["nodes"],
lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],
)
)
# Get the most recent PR for each branch base (group gh together)
branch_bases = set()
for pr in pr_infos:
branch_base_name = pr["headRefName"]
if x := re.match(r"(gh\/.+)\/(head|base|orig)", branch_base_name):
branch_base_name = x.group(1)
branch_bases.add(branch_base_name)
return branch_bases
def delete_branch(repo: GitRepo, branch: str) -> None:
repo._run_git("push", "origin", "-d", branch)
def delete_branches() -> None:
now = datetime.now().timestamp()
git_repo = GitRepo(str(REPO_ROOT), "origin", debug=True)
branches = get_branches(git_repo)
prs_by_branch = get_recent_prs()
keep_branches = get_branches_with_magic_label_or_open_pr()
delete = []
# Do not delete if:
# * associated PR is open, closed but updated recently, or contains the magic string
# * no associated PR and branch was updated in last 1.5 years
# * is protected
# Setting different values of PR_WINDOW will change how branches with closed
# PRs are treated depending on how old the branch is. The default value of
# 90 will allow branches with closed PRs to be deleted if the PR hasn't been
# updated in 90 days and the branch hasn't been updated in 1.5 years
for base_branch, (date, sub_branches) in branches.items():
print(f"[{base_branch}] Updated {(now - date) / SEC_IN_DAY} days ago")
if base_branch in keep_branches:
print(f"[{base_branch}] Has magic label or open PR, skipping")
continue
pr = prs_by_branch.get(base_branch)
if pr:
print(
f"[{base_branch}] Has PR {pr['number']}: {pr['state']}, updated {(now - pr['updatedAt']) / SEC_IN_DAY} days ago"
)
if (
now - pr["updatedAt"] < CLOSED_PR_RETENTION
or (now - date) < CLOSED_PR_RETENTION
):
continue
elif now - date < NO_PR_RETENTION:
continue
print(f"[{base_branch}] Checking for branch protections")
if any(is_protected(sub_branch) for sub_branch in sub_branches):
print(f"[{base_branch}] Is protected")
continue
for sub_branch in sub_branches:
print(f"[{base_branch}] Deleting {sub_branch}")
delete.append(sub_branch)
if ESTIMATED_TOKENS[0] > 400:
print("Estimated tokens exceeded, exiting")
break
print(f"To delete ({len(delete)}):")
for branch in delete:
print(f"About to delete branch {branch}")
delete_branch(git_repo, branch)
if __name__ == "__main__":
delete_branches()

View File

@ -1,139 +0,0 @@
import os
import re
import sys
from typing import Any, cast, Dict, List, NamedTuple, Tuple
import rockset # type: ignore[import]
from gitutils import _check_output
def eprint(msg: str) -> None:
print(msg, file=sys.stderr)
class WorkflowCheck(NamedTuple):
workflowName: str
name: str
jobName: str
conclusion: str
def get_latest_commits() -> List[str]:
latest_viable_commit = _check_output(
[
"git",
"log",
"-n",
"1",
"--pretty=format:%H",
"origin/viable/strict",
],
encoding="ascii",
)
commits = _check_output(
[
"git",
"rev-list",
f"{latest_viable_commit}^..HEAD",
"--remotes=*origin/main",
],
encoding="ascii",
).splitlines()
return commits
def query_commits(commits: List[str]) -> List[Dict[str, Any]]:
rs = rockset.RocksetClient(
host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]
)
params = [{"name": "shas", "type": "string", "value": ",".join(commits)}]
res = rs.QueryLambdas.execute_query_lambda(
# https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query
query_lambda="commit_jobs_batch_query",
version="19c74e10819104f9",
workspace="commons",
parameters=params,
)
return cast(List[Dict[str, Any]], res.results)
def print_commit_status(commit: str, results: Dict[str, Any]) -> None:
print(commit)
for check in results["results"]:
if check["sha"] == commit:
print(f"\t{check['conclusion']:>10}: {check['name']}")
def get_commit_results(
commit: str, results: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
workflow_checks = []
for check in results:
if check["sha"] == commit:
workflow_checks.append(
WorkflowCheck(
workflowName=check["workflowName"],
name=check["name"],
jobName=check["jobName"],
conclusion=check["conclusion"],
)._asdict()
)
return workflow_checks
def isGreen(commit: str, results: List[Dict[str, Any]]) -> Tuple[bool, str]:
workflow_checks = get_commit_results(commit, results)
regex = {
"pull": False,
"trunk": False,
"lint": False,
"linux-binary": False,
}
for check in workflow_checks:
jobName = check["jobName"]
# Ignore result from unstable job, be it success or failure
if "unstable" in jobName:
continue
workflowName = check["workflowName"]
conclusion = check["conclusion"]
for required_check in regex:
if re.match(required_check, workflowName, flags=re.IGNORECASE):
if conclusion not in ["success", "skipped"]:
return (False, workflowName + " checks were not successful")
else:
regex[required_check] = True
missing_workflows = [x for x in regex.keys() if not regex[x]]
if len(missing_workflows) > 0:
return (False, "missing required workflows: " + ", ".join(missing_workflows))
return (True, "")
def get_latest_green_commit(commits: List[str], results: List[Dict[str, Any]]) -> Any:
for commit in commits:
eprint(f"Checking {commit}")
is_green, msg = isGreen(commit, results)
if is_green:
eprint("GREEN")
return commit
else:
eprint("RED: " + msg)
return None
def main() -> None:
commits = get_latest_commits()
results = query_commits(commits)
latest_viable_commit = get_latest_green_commit(commits, results)
print(latest_viable_commit)
if __name__ == "__main__":
main()

View File

@ -62,9 +62,9 @@ SUPPORTED_PERIODICAL_MODES: Dict[str, Callable[[Optional[str]], bool]] = {
}
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=qO7aEr.Og33PtLXfNq0j0yj.bbLC7SzR"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=7NhgpqKTtGXVUnL1C79KboTW_5qQx8y5"
# Some constants used to handle disabled and unstable jobs
JOB_NAME_SEP = "/"
@ -474,6 +474,10 @@ def get_reenabled_issues(pr_body: str = "") -> List[str]:
return parse_reenabled_issues(pr_body) + parse_reenabled_issues(commit_messages)
def check_for_setting(labels: Set[str], body: str, setting: str) -> bool:
return setting in labels or f"[{setting}]" in body
def perform_misc_tasks(
labels: Set[str], test_matrix: Dict[str, List[Any]], job_name: str, pr_body: str
) -> None:
@ -481,7 +485,15 @@ def perform_misc_tasks(
In addition to apply the filter logic, the script also does the following
misc tasks to set keep-going and is-unstable variables
"""
set_output("keep-going", "keep-going" in labels)
set_output("keep-going", check_for_setting(labels, pr_body, "keep-going"))
set_output(
"ci-verbose-test-logs",
check_for_setting(labels, pr_body, "ci-verbose-test-logs"),
)
set_output(
"ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")
)
set_output("ci-no-td", check_for_setting(labels, pr_body, "ci-no-td"))
# Obviously, if the job name includes unstable, then this is an unstable job
is_unstable = job_name and IssueType.UNSTABLE.value in job_name
@ -577,7 +589,7 @@ def main() -> None:
labels=labels,
test_matrix=filtered_test_matrix,
job_name=args.job_name,
pr_body=pr_body,
pr_body=pr_body if pr_body else "",
)
# Set the filtered test matrix as the output

View File

@ -22,7 +22,7 @@ CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8"}
ROCM_ARCHES = ["5.6", "5.7"]
ROCM_ARCHES = ["5.7", "6.0"]
CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]
@ -42,7 +42,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.19.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.1": (
@ -55,7 +55,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.19.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
}

View File

@ -274,42 +274,6 @@ WINDOWS_BINARY_SMOKE_WORKFLOWS = [
]
MACOS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.MACOS,
package_type="wheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.MACOS
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.MACOS,
package_type="conda",
build_configs=generate_binary_build_matrix.generate_conda_matrix(
OperatingSystem.MACOS
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_CONDA},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.MACOS,
package_type="libtorch",
abi_version=generate_binary_build_matrix.CXX11_ABI,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.MACOS,
generate_binary_build_matrix.CXX11_ABI,
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
package_type="libtorch",
@ -342,7 +306,8 @@ MACOS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
package_type="conda",
cross_compile_arm64=True,
cross_compile_arm64=False,
macos_runner="macos-13-xlarge",
build_configs=generate_binary_build_matrix.generate_conda_matrix(
OperatingSystem.MACOS_ARM64
),

View File

@ -4,7 +4,7 @@
Will output a condensed version of the matrix. Will include fllowing:
* CUDA version short
* CUDA full verison
* CUDA full version
* CUDNN version short
* Image type either runtime or devel
* Platform linux/arm64,linux/amd64

13
.github/scripts/get_aws_session_tokens.py vendored Executable file
View File

@ -0,0 +1,13 @@
#!/usr/bin/env python3
import boto3 # type: ignore[import]
def main() -> None:
creds_dict = boto3.Session().get_credentials().get_frozen_credentials()._asdict()
print(f"export AWS_ACCESS_KEY_ID={creds_dict['access_key']}")
print(f"export AWS_SECRET_ACCESS_KEY={creds_dict['secret_key']}")
print(f"export AWS_SESSION_TOKEN={creds_dict['token']}")
if __name__ == "__main__":
main()

View File

@ -119,6 +119,19 @@ def gh_fetch_json_dict(
return cast(Dict[str, Any], _gh_fetch_json_any(url, params, data))
def gh_graphql(query: str, **kwargs: Any) -> Dict[str, Any]:
rc = gh_fetch_url(
"https://api.github.com/graphql",
data={"query": query, "variables": kwargs},
reader=json.load,
)
if "errors" in rc:
raise RuntimeError(
f"GraphQL query {query}, args {kwargs} failed: {rc['errors']}"
)
return cast(Dict[str, Any], rc)
def _gh_post_comment(
url: str, comment: str, dry_run: bool = False
) -> List[Dict[str, Any]]:

View File

@ -155,12 +155,19 @@ class GitRepo:
)
return [x.strip() for x in rc.split("\n") if x.strip()] if len(rc) > 0 else []
def current_branch(self) -> str:
return self._run_git("symbolic-ref", "--short", "HEAD").strip()
def current_branch(self) -> Optional[str]:
try:
return self._run_git("symbolic-ref", "--short", "HEAD").strip()
except RuntimeError:
# we are in detached HEAD state
return None
def checkout(self, branch: str) -> None:
self._run_git("checkout", branch)
def create_branch_and_checkout(self, branch: str) -> None:
self._run_git("checkout", "-b", branch)
def fetch(self, ref: Optional[str] = None, branch: Optional[str] = None) -> None:
if branch is None and ref is None:
self._run_git("fetch", self.remote)
@ -273,6 +280,7 @@ class GitRepo:
def cherry_pick_commits(self, from_branch: str, to_branch: str) -> None:
orig_branch = self.current_branch()
assert orig_branch is not None, "Must be on a branch"
self.checkout(to_branch)
from_commits, to_commits = self.compute_branch_diffs(from_branch, to_branch)
if len(from_commits) == 0:

Binary file not shown.

View File

@ -74,15 +74,23 @@ def gh_get_labels(org: str, repo: str) -> List[str]:
def gh_add_labels(
org: str, repo: str, pr_num: int, labels: Union[str, List[str]]
org: str, repo: str, pr_num: int, labels: Union[str, List[str]], dry_run: bool
) -> None:
if dry_run:
print(f"Dryrun: Adding labels {labels} to PR {pr_num}")
return
gh_fetch_url_and_headers(
url=f"https://api.github.com/repos/{org}/{repo}/issues/{pr_num}/labels",
data={"labels": labels},
)
def gh_remove_label(org: str, repo: str, pr_num: int, label: str) -> None:
def gh_remove_label(
org: str, repo: str, pr_num: int, label: str, dry_run: bool
) -> None:
if dry_run:
print(f"Dryrun: Removing {label} from PR {pr_num}")
return
gh_fetch_url_and_headers(
url=f"https://api.github.com/repos/{org}/{repo}/issues/{pr_num}/labels/{label}",
method="DELETE",

44
.github/scripts/lintrunner.sh vendored Executable file
View File

@ -0,0 +1,44 @@
#!/usr/bin/env bash
set -ex
# The generic Linux job chooses to use base env, not the one setup by the image
CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"
conda activate "${CONDA_ENV}"
CACHE_DIRECTORY="/tmp/.lintbin"
# Try to recover the cached binaries
if [[ -d "${CACHE_DIRECTORY}" ]]; then
# It's ok to fail this as lintrunner init would download these binaries
# again if they do not exist
cp -r "${CACHE_DIRECTORY}" . || true
fi
# This has already been cached in the docker image
lintrunner init 2> /dev/null
# Do build steps necessary for linters
if [[ "${CLANG}" == "1" ]]; then
python3 -m tools.linter.clang_tidy.generate_build_files
fi
python3 -m tools.generate_torch_version --is_debug=false
python3 -m tools.pyi.gen_pyi \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--tags-path aten/src/ATen/native/tags.yaml \
--deprecated-functions-path "tools/autograd/deprecated.yaml"
RC=0
# Run lintrunner on all files
if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
echo ""
echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"
RC=1
fi
# Use jq to massage the JSON lint output into GitHub Actions workflow commands.
jq --raw-output \
'"::\(if .severity == "advice" or .severity == "disabled" then "warning" else .severity end) file=\(.path),line=\(.line),col=\(.char),title=\(.code) \(.name)::" + (.description | gsub("\\n"; "%0A"))' \
lint.json || true
exit $RC

51
.github/scripts/s390x-ci/README.md vendored Normal file
View File

@ -0,0 +1,51 @@
# Configuring the builder.
## Install prerequisites.
```
$ sudo dnf install docker
```
## Add services.
```
$ sudo cp self-hosted-builder/*.service /etc/systemd/system/
$ sudo systemctl daemon-reload
```
## Download qemu-user-static image
```
# sudo docker pull docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1
```
## Autostart the x86_64 emulation support.
```
$ sudo systemctl enable --now qemu-user-static
```
## Rebuild the image
In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the
latest OS security fixes, use the following commands:
```
$ cd self-hosted-builder
$ sudo docker build \
--build-arg repo=<owner>/<name> \
--build-arg token=<***> \
--pull \
-f actions-runner.Dockerfile \
-t iiilinuxibmcom/actions-runner \
.
```
If it fails, ensure that selinux doesn't prevent it from working.
In worst case, selinux can be disabled with `setenforce 0`.
## Autostart the runner.
```
$ sudo systemctl enable --now actions-runner@$NAME
```

View File

@ -0,0 +1,66 @@
# Self-Hosted IBM Z Github Actions Runner.
# Temporary image: amd64 dependencies.
FROM docker.io/amd64/ubuntu:22.04 as ld-prefix
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3
# Main image.
FROM docker.io/s390x/ubuntu:22.04
# Packages for pytorch building and testing.
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y install \
cmake \
curl \
gcc \
git \
jq \
libxml2-dev \
libxslt-dev \
ninja-build \
python-is-python3 \
python3 \
python3-dev \
python3-pip \
pybind11-dev \
python3-numpy \
libopenblas-dev \
liblapack-dev \
libgloo-dev \
python3-yaml \
python3-scipy \
virtualenv
# amd64 dependencies.
COPY --from=ld-prefix / /usr/x86_64-linux-gnu/
RUN ln -fs ../lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/x86_64-linux-gnu/lib64/
RUN ln -fs /etc/resolv.conf /usr/x86_64-linux-gnu/etc/
ENV QEMU_LD_PREFIX=/usr/x86_64-linux-gnu
# Scripts.
COPY fs/ /
RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint
# amd64 Github Actions Runner.
RUN useradd -m actions-runner
USER actions-runner
WORKDIR /home/actions-runner
RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz
# repository
ARG repo
# repository token
ARG token
RUN ./config.sh \
--unattended \
--url "https://github.com/${repo}" \
--token "${token}" \
--no-default-labels \
--labels self-hosted,linux.s390x
ENTRYPOINT ["/usr/bin/entrypoint"]
CMD ["/usr/bin/actions-runner"]

View File

@ -0,0 +1,22 @@
[Unit]
Description=Self-Hosted IBM Z Github Actions Runner
Wants=qemu-user-static
After=qemu-user-static
StartLimitIntervalSec=0
[Service]
Type=simple
Restart=always
ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i
ExecStart=/usr/bin/docker run \
--init \
--interactive \
--name=actions-runner.%i \
--rm \
iiilinuxibmcom/actions-runner
ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"
ExecStop=/bin/sh -c "docker wait actions-runner.%i"
ExecStop=/bin/sh -c "docker rm actions-runner.%i"
[Install]
WantedBy=multi-user.target

View File

@ -0,0 +1,6 @@
#!/usr/bin/env bash
set -e -u
# Run one job.
./run.sh --once

View File

@ -0,0 +1,30 @@
#!/usr/bin/env bash
#
# Container entrypoint that waits for all spawned processes.
#
set -e -u
# Create a FIFO and start reading from its read end.
tempdir=$(mktemp -d "/tmp/done.XXXXXXXXXX")
trap 'rm -r "$tempdir"' EXIT
done="$tempdir/pipe"
mkfifo "$done"
cat "$done" & waiter=$!
# Start the workload. Its descendants will inherit the FIFO's write end.
status=0
if [ "$#" -eq 0 ]; then
bash 9>"$done" || status=$?
else
"$@" 9>"$done" || status=$?
fi
# When the workload and all of its descendants exit, the FIFO's write end will
# be closed and `cat "$done"` will exit. Wait until it happens. This is needed
# in order to handle SelfUpdater, which the workload may start in background
# before exiting.
wait "$waiter"
exit "$status"

View File

@ -0,0 +1,11 @@
[Unit]
Description=Support for transparent execution of non-native binaries with QEMU user emulation
[Service]
Type=oneshot
# The source code for iiilinuxibmcom/qemu-user-static is at https://github.com/iii-i/qemu-user-static/tree/v6.1.0-1
# TODO: replace it with multiarch/qemu-user-static once version >6.1 is available
ExecStart=/usr/bin/docker run --rm --interactive --privileged docker.io/iiilinuxibmcom/qemu-user-static:6.1.0-1 --reset -p yes
[Install]
WantedBy=multi-user.target

View File

@ -1,148 +0,0 @@
from typing import Any, Dict, List
from unittest import main, mock, TestCase
from fetch_latest_green_commit import isGreen, WorkflowCheck
workflowNames = [
"pull",
"trunk",
"Lint",
"linux-binary-libtorch-pre-cxx11",
"android-tests",
"windows-binary-wheel",
"periodic",
"docker-release-builds",
"nightly",
"pr-labels",
"Close stale pull requests",
"Update S3 HTML indices for download.pytorch.org",
"Create Release",
]
def set_workflow_job_status(
workflow: List[Dict[str, Any]], name: str, status: str
) -> List[Dict[str, Any]]:
for check in workflow:
if check["workflowName"] == name:
check["conclusion"] = status
return workflow
class TestChecks:
def make_test_checks(self) -> List[Dict[str, Any]]:
workflow_checks = []
for i in range(len(workflowNames)):
workflow_checks.append(
WorkflowCheck(
workflowName=workflowNames[i],
name="test/job",
jobName="job",
conclusion="success",
)._asdict()
)
return workflow_checks
class TestPrintCommits(TestCase):
@mock.patch(
"fetch_latest_green_commit.get_commit_results",
return_value=TestChecks().make_test_checks(),
)
def test_all_successful(self, mock_get_commit_results: Any) -> None:
"Test with workflows are successful"
workflow_checks = mock_get_commit_results()
self.assertTrue(isGreen("sha", workflow_checks)[0])
@mock.patch(
"fetch_latest_green_commit.get_commit_results",
return_value=TestChecks().make_test_checks(),
)
def test_necessary_successful(self, mock_get_commit_results: Any) -> None:
"Test with necessary workflows are successful"
workflow_checks = mock_get_commit_results()
workflow_checks = set_workflow_job_status(
workflow_checks, workflowNames[8], "failed"
)
workflow_checks = set_workflow_job_status(
workflow_checks, workflowNames[9], "failed"
)
workflow_checks = set_workflow_job_status(
workflow_checks, workflowNames[10], "failed"
)
workflow_checks = set_workflow_job_status(
workflow_checks, workflowNames[11], "failed"
)
workflow_checks = set_workflow_job_status(
workflow_checks, workflowNames[12], "failed"
)
self.assertTrue(isGreen("sha", workflow_checks)[0])
@mock.patch(
"fetch_latest_green_commit.get_commit_results",
return_value=TestChecks().make_test_checks(),
)
def test_necessary_skipped(self, mock_get_commit_results: Any) -> None:
"Test with necessary job (ex: pull) skipped"
workflow_checks = mock_get_commit_results()
workflow_checks = set_workflow_job_status(workflow_checks, "pull", "skipped")
result = isGreen("sha", workflow_checks)
self.assertTrue(result[0])
@mock.patch(
"fetch_latest_green_commit.get_commit_results",
return_value=TestChecks().make_test_checks(),
)
def test_skippable_skipped(self, mock_get_commit_results: Any) -> None:
"Test with skippable jobs (periodic and docker-release-builds skipped"
workflow_checks = mock_get_commit_results()
workflow_checks = set_workflow_job_status(
workflow_checks, "periodic", "skipped"
)
workflow_checks = set_workflow_job_status(
workflow_checks, "docker-release-builds", "skipped"
)
self.assertTrue(isGreen("sha", workflow_checks))
@mock.patch(
"fetch_latest_green_commit.get_commit_results",
return_value=TestChecks().make_test_checks(),
)
def test_necessary_failed(self, mock_get_commit_results: Any) -> None:
"Test with necessary job (ex: Lint) failed"
workflow_checks = mock_get_commit_results()
workflow_checks = set_workflow_job_status(workflow_checks, "Lint", "failed")
result = isGreen("sha", workflow_checks)
self.assertFalse(result[0])
self.assertEqual(result[1], "Lint checks were not successful")
@mock.patch(
"fetch_latest_green_commit.get_commit_results",
return_value=TestChecks().make_test_checks(),
)
def test_skippable_failed(self, mock_get_commit_results: Any) -> None:
"Test with failing skippable jobs (ex: docker-release-builds) should pass"
workflow_checks = mock_get_commit_results()
workflow_checks = set_workflow_job_status(
workflow_checks, "periodic", "skipped"
)
workflow_checks = set_workflow_job_status(
workflow_checks, "docker-release-builds", "failed"
)
result = isGreen("sha", workflow_checks)
self.assertTrue(result[0])
@mock.patch("fetch_latest_green_commit.get_commit_results", return_value={})
def test_no_workflows(self, mock_get_commit_results: Any) -> None:
"Test with missing workflows"
workflow_checks = mock_get_commit_results()
result = isGreen("sha", workflow_checks)
self.assertFalse(result[0])
self.assertEqual(
result[1],
"missing required workflows: pull, trunk, lint, linux-binary",
)
if __name__ == "__main__":
main()

View File

@ -636,55 +636,108 @@ class TestConfigFilter(TestCase):
@mock.patch("subprocess.check_output")
def test_perform_misc_tasks(self, mocked_subprocess: Any) -> None:
def _gen_expected_string(
keep_going: bool = False,
ci_verbose_test_logs: bool = False,
ci_no_test_timeout: bool = False,
ci_no_td: bool = False,
is_unstable: bool = False,
reenabled_issues: str = "",
) -> str:
return (
f"keep-going={keep_going}\n"
f"ci-verbose-test-logs={ci_verbose_test_logs}\n"
f"ci-no-test-timeout={ci_no_test_timeout}\n"
f"ci-no-td={ci_no_td}\n"
f"is-unstable={is_unstable}\n"
f"reenabled-issues={reenabled_issues}\n"
)
mocked_subprocess.return_value = b""
testcases: List[Dict[str, Any]] = [
{
"labels": {},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"expected": "keep-going=False\nis-unstable=False\nreenabled-issues=\n",
"expected": _gen_expected_string(),
"description": "No keep-going, no is-unstable",
},
{
"labels": {"keep-going"},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"expected": "keep-going=True\nis-unstable=False\nreenabled-issues=\n",
"expected": _gen_expected_string(keep_going=True),
"description": "Has keep-going, no is-unstable",
},
{
"labels": {},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"pr_body": "[keep-going]",
"expected": _gen_expected_string(keep_going=True),
"description": "Keep-going in PR body",
},
{
"labels": {"ci-verbose-test-logs"},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"pr_body": "[ci-no-test-timeout]",
"expected": _gen_expected_string(
ci_verbose_test_logs=True, ci_no_test_timeout=True
),
"description": "No pipe logs label and no test timeout in PR body",
},
{
"labels": {"ci-no-test-timeout"},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"pr_body": "[ci-verbose-test-logs]",
"expected": _gen_expected_string(
ci_verbose_test_logs=True, ci_no_test_timeout=True
),
"description": "No pipe logs in PR body and no test timeout in label (same as the above but swapped)",
},
{
"labels": {"ci-no-td"},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"pr_body": "",
"expected": _gen_expected_string(ci_no_td=True),
"description": "No pipe logs in PR body and no test timeout in label (same as the above but swapped)",
},
{
"labels": {},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": None,
"expected": "keep-going=False\nis-unstable=False\nreenabled-issues=\n",
"expected": _gen_expected_string(),
"description": "No job name",
},
{
"labels": {},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12, unstable)",
"expected": "keep-going=False\nis-unstable=True\nreenabled-issues=\n",
"job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable, unstable)",
"expected": _gen_expected_string(is_unstable=True),
"description": "Unstable job",
},
{
"labels": {},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12, unstable)",
"expected": "keep-going=False\nis-unstable=True\nreenabled-issues=\n",
"job_name": "macos-12-py3-arm64 / test (default, 1, 3, macos-m1-stable, unstable)",
"expected": _gen_expected_string(is_unstable=True),
"description": "Unstable job",
},
{
"labels": {},
"test_matrix": '{include: [{config: "1", unstable: "unstable"}, {config: "2", unstable: "unstable"}]}',
"job_name": "macos-12-py3-arm64 / build",
"expected": "keep-going=False\nis-unstable=True\nreenabled-issues=\n",
"expected": _gen_expected_string(is_unstable=True),
"description": "All configs are unstable",
},
{
"labels": {},
"test_matrix": '{include: [{config: "1", unstable: "unstable"}, {config: "2"}]}',
"job_name": "macos-12-py3-arm64 / build",
"expected": "keep-going=False\nis-unstable=False\nreenabled-issues=\n",
"expected": _gen_expected_string(is_unstable=False),
"description": "Only mark some configs as unstable",
},
{
@ -692,7 +745,7 @@ class TestConfigFilter(TestCase):
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"pr_body": "resolves #123 fixes #234",
"expected": "keep-going=False\nis-unstable=False\nreenabled-issues=123,234\n",
"expected": _gen_expected_string(reenabled_issues="123,234"),
"description": "Reenable some issues",
},
]

View File

@ -16,6 +16,8 @@ from typing import Any, Dict, List, Optional
from unittest import main, mock, skip, TestCase
from urllib.error import HTTPError
from github_utils import gh_graphql
from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo
from trymerge import (
@ -26,7 +28,6 @@ from trymerge import (
get_drci_classifications,
get_rockset_results,
gh_get_team_members,
gh_graphql,
GitHubPR,
JobCheckState,
main as trymerge_main,
@ -140,11 +141,14 @@ def mock_parse_args(revert: bool = False, force: bool = False) -> Any:
self.comment_id = 0
self.reason = "this is for testing"
self.ignore_current = False
self.check_mergeability = False
return Object()
def mock_remove_label(org: str, repo: str, pr_num: str, label: str) -> None:
def mock_remove_label(
org: str, repo: str, pr_num: str, label: str, dry_run: bool
) -> None:
pass
@ -431,6 +435,13 @@ class TestTryMerge(TestCase):
assert pr._reviews is not None # to pacify mypy
self.assertGreater(len(pr._reviews), 100)
def get_co_authors(self, *args: Any) -> None:
"""Tests that co-authors are recognized"""
pr = GitHubPR("pytorch", "pytorch", 118347)
authors = pr.get_authors()
self.assertIn("kit1980", authors)
self.assertIn("Co-authored-by:", pr.gen_commit_message())
def test_get_checkruns_many_runs(self, *args: Any) -> None:
"""Tests that all checkruns can be fetched"""
pr = GitHubPR("pytorch", "pytorch", 105260)

View File

@ -39,6 +39,7 @@ from github_utils import (
gh_fetch_json_list,
gh_fetch_merge_base,
gh_fetch_url,
gh_graphql,
gh_post_commit_comment,
gh_post_pr_comment,
gh_update_pr_state,
@ -152,12 +153,14 @@ GH_COMMIT_AUTHORS_FRAGMENT = """
fragment CommitAuthors on PullRequestCommitConnection {
nodes {
commit {
author {
user {
login
authors(first: 2) {
nodes {
user {
login
}
email
name
}
email
name
}
oid
}
@ -458,19 +461,6 @@ HAS_NO_CONNECTED_DIFF_TITLE = (
IGNORABLE_FAILED_CHECKS_THESHOLD = 10
def gh_graphql(query: str, **kwargs: Any) -> Dict[str, Any]:
rc = gh_fetch_url(
"https://api.github.com/graphql",
data={"query": query, "variables": kwargs},
reader=json.load,
)
if "errors" in rc:
raise RuntimeError(
f"GraphQL query {query}, args {kwargs} failed: {rc['errors']}"
)
return cast(Dict[str, Any], rc)
def gh_get_pr_info(org: str, proj: str, pr_no: int) -> Any:
rc = gh_graphql(GH_GET_PR_INFO_QUERY, name=proj, owner=org, number=pr_no)
return rc["data"]["repository"]["pullRequest"]
@ -608,6 +598,7 @@ def parse_args() -> Any:
parser.add_argument("--revert", action="store_true")
parser.add_argument("--force", action="store_true")
parser.add_argument("--ignore-current", action="store_true")
parser.add_argument("--check-mergeability", action="store_true")
parser.add_argument("--comment-id", type=int)
parser.add_argument("--reason", type=str)
parser.add_argument("pr_num", type=int)
@ -745,7 +736,7 @@ class GitHubPR:
# work for ghstack where the base is the custom branch, i.e. gh/USER/ID/base,
# so let's just use main instead
self.merge_base = gh_fetch_merge_base(
self.org, self.project, last_commit_oid, "main"
self.org, self.project, last_commit_oid, self.default_branch()
)
# Fallback to baseRefOid if the API call fails, i.e. rate limit. Note that baseRefOid
@ -845,14 +836,14 @@ class GitHubPR:
def add_authors(info: Dict[str, Any]) -> None:
for node in info["commits_with_authors"]["nodes"]:
author_node = node["commit"]["author"]
user_node = author_node["user"]
author = f"{author_node['name']} <{author_node['email']}>"
if user_node is None:
# If author is not github user, user node will be null
authors.append(("", author))
else:
authors.append((cast(str, user_node["login"]), author))
for author_node in node["commit"]["authors"]["nodes"]:
user_node = author_node["user"]
author = f"{author_node['name']} <{author_node['email']}>"
if user_node is None:
# If author is not github user, user node will be null
authors.append(("", author))
else:
authors.append((cast(str, user_node["login"]), author))
info = self.info
for _ in range(100):
@ -948,11 +939,6 @@ class GitHubPR:
def get_authors(self) -> Dict[str, str]:
rc = {}
# TODO: replace with `self.get_commit_count()` when GraphQL pagination can be used
# to fetch all commits, see https://gist.github.com/malfet/4f35321b0c9315bcd7116c7b54d83372
# and https://support.github.com/ticket/enterprise/1642/1659119
if self.get_commit_count() <= 250:
assert len(self._fetch_authors()) == self.get_commit_count()
for idx in range(len(self._fetch_authors())):
rc[self.get_committer_login(idx)] = self.get_committer_author(idx)
@ -1068,6 +1054,7 @@ class GitHubPR:
repo: GitRepo,
skip_mandatory_checks: bool,
comment_id: Optional[int] = None,
skip_all_rule_checks: bool = False,
) -> List["GitHubPR"]:
assert self.is_ghstack_pr()
ghstack_prs = get_ghstack_prs(
@ -1082,7 +1069,7 @@ class GitHubPR:
commit_msg = pr.gen_commit_message(
filter_ghstack=True, ghstack_deps=pr_dependencies
)
if pr.pr_num != self.pr_num:
if pr.pr_num != self.pr_num and not skip_all_rule_checks:
# Raises exception if matching rule is not found
find_matching_merge_rule(
pr,
@ -1113,13 +1100,19 @@ class GitHubPR:
msg_body = re.sub(RE_GHSTACK_DESC, "", msg_body)
msg = self.get_title() + f" (#{self.pr_num})\n\n"
msg += msg_body
# Mention PR co-authors
for author_login, author_name in self.get_authors().items():
if author_login != self.get_pr_creator_login():
msg += f"\nCo-authored-by: {author_name}"
msg += f"\nPull Request resolved: {self.get_pr_url()}\n"
msg += f"Approved by: {approved_by_urls}\n"
if ghstack_deps:
msg += f"ghstack dependencies: {', '.join([f'#{pr.pr_num}' for pr in ghstack_deps])}\n"
return msg
def add_numbered_label(self, label_base: str) -> None:
def add_numbered_label(self, label_base: str, dry_run: bool) -> None:
labels = self.get_labels() if self.labels is not None else []
full_label = label_base
count = 0
@ -1127,7 +1120,7 @@ class GitHubPR:
if label_base in label:
count += 1
full_label = f"{label_base}X{count}"
gh_add_labels(self.org, self.project, self.pr_num, [full_label])
gh_add_labels(self.org, self.project, self.pr_num, [full_label], dry_run)
def merge_into(
self,
@ -1157,9 +1150,9 @@ class GitHubPR:
repo.push(self.default_branch(), dry_run)
if not dry_run:
self.add_numbered_label(MERGE_COMPLETE_LABEL)
self.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)
for pr in additional_merged_prs:
pr.add_numbered_label(MERGE_COMPLETE_LABEL)
pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)
if comment_id and self.pr_num:
# When the merge process reaches this part, we can assume that the commit
@ -1199,7 +1192,11 @@ class GitHubPR:
skip_mandatory_checks: bool = False,
comment_id: Optional[int] = None,
branch: Optional[str] = None,
skip_all_rule_checks: bool = False,
) -> List["GitHubPR"]:
"""
:param skip_all_rule_checks: If true, skips all rule checks, useful for dry-running merge locally
"""
branch_to_merge_into = self.default_branch() if branch is None else branch
if repo.current_branch() != branch_to_merge_into:
repo.checkout(branch_to_merge_into)
@ -1215,6 +1212,7 @@ class GitHubPR:
repo,
skip_mandatory_checks,
comment_id=comment_id,
skip_all_rule_checks=skip_all_rule_checks,
)
@ -1669,7 +1667,19 @@ def get_classifications(
# going forward. It's preferable to try calling Dr.CI API directly first
# to get the latest results as well as update Dr.CI PR comment
drci_classifications = get_drci_classifications(pr_num=pr_num, project=project)
print(f"From Dr.CI API: {json.dumps(drci_classifications)}")
def get_readable_drci_results(drci_classifications: Any) -> str:
try:
s = f"From Dr.CI API ({pr_num}):\n"
for classification, jobs in drci_classifications.items():
s += f" {classification}: \n"
for job in jobs:
s += f" {job['id']} {job['name']}\n"
return s
except Exception:
return f"From Dr.CI API: {json.dumps(drci_classifications)}"
print(get_readable_drci_results(drci_classifications))
# NB: if the latest results from Dr.CI is not available, i.e. when calling from
# SandCastle, we fallback to any results we can find on Dr.CI check run summary
@ -1882,8 +1892,8 @@ def do_revert_prs(
pr.org, pr.project, pr.pr_num, revert_message, dry_run=dry_run
)
pr.add_numbered_label("reverted", dry_run)
if not dry_run:
pr.add_numbered_label("reverted")
gh_post_commit_comment(pr.org, pr.project, commit_sha, revert_msg)
gh_update_pr_state(pr.org, pr.project, pr.pr_num)
@ -2053,7 +2063,7 @@ def merge(
print(f"Attempting merge of {initial_commit_sha} ({pr_link})")
if MERGE_IN_PROGRESS_LABEL not in pr.get_labels():
gh_add_labels(pr.org, pr.project, pr.pr_num, [MERGE_IN_PROGRESS_LABEL])
gh_add_labels(pr.org, pr.project, pr.pr_num, [MERGE_IN_PROGRESS_LABEL], dry_run)
explainer = TryMergeExplainer(
skip_mandatory_checks,
@ -2073,8 +2083,7 @@ def merge(
check_for_sev(pr.org, pr.project, skip_mandatory_checks)
if skip_mandatory_checks or can_skip_internal_checks(pr, comment_id):
# do not wait for any pending signals if PR is closed as part of co-development process
if skip_mandatory_checks:
gh_post_pr_comment(
pr.org,
pr.project,
@ -2201,8 +2210,7 @@ def merge(
# Finally report timeout back
msg = f"Merged timed out after {timeout_minutes} minutes. Please contact the pytorch_dev_infra team."
msg += f"The last exception was: {last_exception}"
if not dry_run:
gh_add_labels(pr.org, pr.project, pr.pr_num, ["land-failed"])
gh_add_labels(pr.org, pr.project, pr.pr_num, ["land-failed"], dry_run)
raise RuntimeError(msg)
@ -2281,6 +2289,16 @@ def main() -> None:
)
return
if args.check_mergeability:
if pr.is_ghstack_pr():
get_ghstack_prs(repo, pr) # raises error if out of sync
pr.merge_changes(
repo,
skip_mandatory_checks=True,
skip_all_rule_checks=True,
)
return
if not args.force and pr.has_invalid_submodule_updates():
message = (
f"This PR updates submodules {', '.join(pr.get_changed_submodules())}\n"
@ -2329,7 +2347,10 @@ def main() -> None:
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
finally:
gh_remove_label(org, project, args.pr_num, MERGE_IN_PROGRESS_LABEL)
if not args.check_mergeability:
gh_remove_label(
org, project, args.pr_num, MERGE_IN_PROGRESS_LABEL, args.dry_run
)
if __name__ == "__main__":

View File

@ -1,171 +0,0 @@
import json
import os
import subprocess
from argparse import ArgumentParser
from typing import Any, Dict
import requests
UPDATEBOT_TOKEN = os.environ["UPDATEBOT_TOKEN"]
PYTORCHBOT_TOKEN = os.environ["PYTORCHBOT_TOKEN"]
OWNER, REPO = "pytorch", "pytorch"
def git_api(
url: str, params: Dict[str, str], type: str = "get", token: str = UPDATEBOT_TOKEN
) -> Any:
headers = {
"Accept": "application/vnd.github.v3+json",
"Authorization": f"token {token}",
}
if type == "post":
return requests.post(
f"https://api.github.com{url}",
data=json.dumps(params),
headers=headers,
).json()
elif type == "patch":
return requests.patch(
f"https://api.github.com{url}",
data=json.dumps(params),
headers=headers,
).json()
else:
return requests.get(
f"https://api.github.com{url}",
params=params,
headers=headers,
).json()
def parse_args() -> Any:
parser = ArgumentParser("Rebase PR into branch")
parser.add_argument("--repo-name", type=str)
parser.add_argument("--branch", type=str)
parser.add_argument("--pin-folder", type=str)
return parser.parse_args()
def make_pr(repo_name: str, branch_name: str) -> Any:
params = {
"title": f"[{repo_name} hash update] update the pinned {repo_name} hash",
"head": branch_name,
"base": "main",
"body": "This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/"
+ f".github/workflows/_update-commit-hash.yml).\nUpdate the pinned {repo_name} hash.",
}
response = git_api(f"/repos/{OWNER}/{REPO}/pulls", params, type="post")
print(f"made pr {response['html_url']}")
return response["number"]
def approve_pr(pr_number: str) -> None:
params = {"event": "APPROVE"}
# use pytorchbot to approve the pr
git_api(
f"/repos/{OWNER}/{REPO}/pulls/{pr_number}/reviews",
params,
type="post",
token=PYTORCHBOT_TOKEN,
)
def make_comment(pr_number: str, msg: str) -> None:
params = {"body": msg}
# comment with pytorchbot because pytorchmergebot gets ignored
git_api(
f"/repos/{OWNER}/{REPO}/issues/{pr_number}/comments",
params,
type="post",
token=PYTORCHBOT_TOKEN,
)
def close_pr(pr_number: str) -> None:
params = {"state": "closed"}
git_api(
f"/repos/{OWNER}/{REPO}/pulls/{pr_number}",
params,
type="patch",
)
def is_newer_hash(new_hash: str, old_hash: str, repo_name: str) -> bool:
def _get_date(hash: str) -> int:
# this git command prints the unix timestamp of the hash
return int(
subprocess.run(
f"git show --no-patch --no-notes --pretty=%ct {hash}".split(),
capture_output=True,
cwd=f"{repo_name}",
)
.stdout.decode("utf-8")
.strip()
)
return _get_date(new_hash) > _get_date(old_hash)
def main() -> None:
args = parse_args()
branch_name = os.environ["NEW_BRANCH_NAME"]
pr_num = None
# query to see if a pr already exists
params = {
"q": f"is:pr is:open in:title author:pytorchupdatebot repo:{OWNER}/{REPO} {args.repo_name} hash update",
"sort": "created",
}
response = git_api("/search/issues", params)
if response["total_count"] != 0:
# pr does exist
pr_num = response["items"][0]["number"]
link = response["items"][0]["html_url"]
response = git_api(f"/repos/{OWNER}/{REPO}/pulls/{pr_num}", {})
branch_name = response["head"]["ref"]
print(
f"pr does exist, number is {pr_num}, branch name is {branch_name}, link is {link}"
)
hash = (
subprocess.run(
f"git rev-parse {args.branch}".split(),
capture_output=True,
cwd=f"{args.repo_name}",
)
.stdout.decode("utf-8")
.strip()
)
with open(f"{args.pin_folder}/{args.repo_name}.txt", "r+") as f:
old_hash = f.read().strip()
subprocess.run(f"git checkout {old_hash}".split(), cwd=args.repo_name)
f.seek(0)
f.truncate()
f.write(f"{hash}\n")
if is_newer_hash(hash, old_hash, args.repo_name):
# if there was an update, push to branch
subprocess.run(f"git checkout -b {branch_name}".split())
subprocess.run(f"git add {args.pin_folder}/{args.repo_name}.txt".split())
subprocess.run(
"git commit -m".split() + [f"update {args.repo_name} commit hash"]
)
subprocess.run(f"git push --set-upstream origin {branch_name} -f".split())
print(f"changes pushed to branch {branch_name}")
if pr_num is None:
# no existing pr, so make a new one and approve it
pr_num = make_pr(args.repo_name, branch_name)
approve_pr(pr_num)
make_comment(pr_num, "@pytorchbot merge")
else:
print(
f"tried to update from old hash: {old_hash} to new hash: {hash} but the old hash seems to be newer, not creating pr"
)
if pr_num is not None:
make_comment(pr_num, "closing pr as the current hash seems up to date")
close_pr(pr_num)
print(f"closing PR {pr_num}")
if __name__ == "__main__":
main()

View File

@ -8,7 +8,7 @@
# NOTE: If testing pytorch/builder changes you can change this variable to change what pytorch/builder reference
# the binary builds will check out
{%- set builder_repo = "pytorch/builder" -%}
{%- set builder_branch = "main" -%}
{%- set builder_branch = "release/2.3" -%}
{%- macro concurrency(build_environment) -%}
concurrency:

View File

@ -7,6 +7,7 @@
name: !{{ build_environment }}
{%- endblock %}
on:
push:
{%- if branches == "nightly" %}
@ -99,8 +100,8 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"

View File

@ -81,8 +81,8 @@ jobs:
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v2.8.2
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}

View File

@ -53,6 +53,9 @@
{%- macro upload_binaries(config, is_windows=False, has_test=True, use_s3=True) -%}
!{{ config["build_name"] }}-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
{%- if has_test %}
needs: !{{ config["build_name"] }}-test
{%- else %}
@ -65,8 +68,6 @@
{%- endif %}
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
aws-pytorch-uploader-access-key-id: ${{ secrets.AWS_PYTORCH_UPLOADER_ACCESS_KEY_ID }}
aws-pytorch-uploader-secret-access-key: ${{ secrets.AWS_PYTORCH_UPLOADER_SECRET_ACCESS_KEY }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -65,8 +65,8 @@ jobs:
steps:
!{{ common.setup_ec2_windows() }}
!{{ set_runner_specific_vars() }}
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
- name: Populate binary env
shell: bash
run: |
@ -105,8 +105,8 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch, checkout_pr_head=False) }}
- name: Populate binary env
shell: bash
run: |

View File

@ -37,7 +37,7 @@ jobs:
keep-going: ${{ steps.filter.outputs.keep-going }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
fetch-depth: 1
submodules: false
@ -59,25 +59,25 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -131,7 +131,7 @@ jobs:
export COMMAND
# shellcheck disable=SC2016
COMMAND='(echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh" | docker exec -u jenkins -e BUILD_LITE_INTERPRETER -e GRADLE_OFFLINE=1 -i "$id" bash) 2>&1'
COMMAND='(echo "sudo chown -R jenkins workspace && cd workspace && ./scripts/build_android_gradle.sh" | docker exec -u jenkins -e BUILD_LITE_INTERPRETER -e GRADLE_OFFLINE=1 -i "$id" bash) 2>&1'
echo "${COMMAND}" > ./command.sh && bash ./command.sh
# Skip docker push as this job is purely for size analysis purpose.
# Result binaries are already in `/home/circleci/project/` as it's mounted instead of copied.
@ -141,5 +141,5 @@ jobs:
if: always()
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
if: always()

View File

@ -37,7 +37,7 @@ jobs:
keep-going: ${{ steps.filter.outputs.keep-going }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
fetch-depth: 1
submodules: false
@ -59,25 +59,25 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -157,7 +157,7 @@ jobs:
docker cp "${GITHUB_WORKSPACE}/build_android_install_x86_32" "${ID_X86_32}:/var/lib/jenkins/workspace/build_android_install_x86_32"
# run gradle buildRelease
(echo "./.circleci/scripts/build_android_gradle.sh" | docker exec \
(echo "./scripts/build_android_gradle.sh" | docker exec \
-e BUILD_ENVIRONMENT="pytorch-linux-focal-py3-clang9-android-ndk-r21e-gradle-build" \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
@ -186,5 +186,5 @@ jobs:
if: always()
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
if: always()

View File

@ -42,7 +42,7 @@ jobs:
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
fetch-depth: 1
submodules: false
@ -64,30 +64,30 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.3
if: ${{ inputs.cuda-version != 'cpu' }}
- name: Output disk space left
@ -196,5 +196,5 @@ jobs:
file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
if: always()

View File

@ -78,7 +78,7 @@ on:
jobs:
build:
runs-on: ${{ inputs.runs_on }}
timeout-minutes: 180
timeout-minutes: 210
env:
PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}
@ -139,13 +139,13 @@ jobs:
run: env
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}
@ -173,7 +173,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
@ -187,7 +186,7 @@ jobs:
- name: Checkout pytorch/builder to builder dir
uses: malfet/checkout@silent-checkout
with:
ref: main
ref: release/2.3
submodules: recursive
repository: pytorch/builder
path: builder
@ -213,7 +212,7 @@ jobs:
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -270,7 +269,7 @@ jobs:
- name: Teardown Linux
if: always()
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
- name: Chown workspace
if: always()

View File

@ -127,14 +127,14 @@ jobs:
} >> "${GITHUB_ENV} }}"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
# Setup the environment
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' }}
@ -155,7 +155,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
@ -168,7 +167,7 @@ jobs:
- name: Checkout pytorch/builder to builder dir
uses: malfet/checkout@silent-checkout
with:
ref: main
ref: release/2.3
submodules: recursive
repository: pytorch/builder
path: builder
@ -199,12 +198,12 @@ jobs:
path: "${{ runner.temp }}/artifacts/"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.3
if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -214,7 +213,7 @@ jobs:
- name: Teardown Linux
if: always()
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
- name: Chown workspace
if: always()

View File

@ -59,18 +59,13 @@ on:
github-token:
required: true
description: Github Token
aws-pytorch-uploader-access-key-id:
required: true
description: AWS access key id
aws-pytorch-uploader-secret-access-key:
required: true
description: AWS secret access key
conda-pytorchbot-token:
required: true
description: Conda PyTorchBot token
conda-pytorchbot-token-test:
required: true
description: Conda PyTorchBot token
jobs:
upload:
runs-on: ubuntu-22.04
@ -100,10 +95,24 @@ jobs:
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
no-sudo: true
- name: Configure AWS credentials(PyTorch account) for nightly
if: ${{ github.event_name == 'push' && github.event.ref == 'refs/heads/nightly' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: arn:aws:iam::749337293305:role/gha_workflow_nightly_build_wheels
aws-region: us-east-1
- name: Configure AWS credentials(PyTorch account) for RC builds
if: ${{ github.event_name == 'push' && (startsWith(github.event.ref, 'refs/tags/') && !startsWith(github.event.ref, 'refs/tags/ciflow/')) }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: arn:aws:iam::749337293305:role/gha_workflow_test_build_wheels
aws-region: us-east-1
- name: Download Build Artifacts
id: download-artifacts
# NB: When the previous build job is skipped, there won't be any artifacts and
@ -135,8 +144,6 @@ jobs:
PKG_DIR: "${{ runner.temp }}/artifacts"
UPLOAD_SUBFOLDER: "${{ env.DESIRED_CUDA }}"
# When running these on pull_request events these should be blank
AWS_ACCESS_KEY_ID: ${{ secrets.aws-pytorch-uploader-access-key-id }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.aws-pytorch-uploader-secret-access-key }}
CONDA_PYTORCHBOT_TOKEN: ${{ secrets.conda-pytorchbot-token }}
CONDA_PYTORCHBOT_TOKEN_TEST: ${{ secrets.conda-pytorchbot-token-test }}
BUILD_NAME: ${{ inputs.build_name }}

View File

@ -23,7 +23,7 @@ jobs:
keep-going: ${{ steps.filter.outputs.keep-going }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
fetch-depth: 1
submodules: false
@ -44,7 +44,7 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Set up JDK 8
uses: actions/setup-java@v3
@ -53,7 +53,7 @@ jobs:
distribution: 'temurin'
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3
with:
python-version: 3.8
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

View File

@ -66,7 +66,7 @@ jobs:
name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -77,19 +77,19 @@ jobs:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -187,5 +187,5 @@ jobs:
s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
if: always()

View File

@ -46,7 +46,7 @@ jobs:
keep-going: ${{ steps.filter.outputs.keep-going }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
fetch-depth: 1
submodules: false
@ -80,7 +80,7 @@ jobs:
steps:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Populate CI build options
shell: bash
@ -102,7 +102,7 @@ jobs:
brew install libtool
- name: Setup miniconda for iOS
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3
with:
python-version: "3.9"
environment-file: .github/requirements/conda-env-iOS.txt

View File

@ -73,7 +73,7 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -82,14 +82,14 @@ jobs:
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3
with:
docker-image-name: ${{ inputs.docker-image-name }}
@ -103,7 +103,7 @@ jobs:
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -209,5 +209,5 @@ jobs:
path: sccache-stats-*.json
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
if: always()

View File

@ -57,7 +57,7 @@ jobs:
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.3
if: ${{ !contains(matrix.runner, 'gcp.a100') }}
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -66,14 +66,14 @@ jobs:
docker exec -it $(docker container ps --format '{{.ID}}') bash
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.3
with:
docker-image-name: ${{ inputs.docker-image }}
@ -87,13 +87,13 @@ jobs:
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.3
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.3
if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
- name: Lock NVIDIA A100 40GB Frequency
@ -117,6 +117,10 @@ jobs:
with:
name: ${{ inputs.build-environment }}
- name: Download TD artifacts
continue-on-error: true
uses: ./.github/actions/download-td-artifacts
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
@ -169,6 +173,9 @@ jobs:
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
@ -218,6 +225,9 @@ jobs:
-e NUM_TEST_SHARDS \
-e REENABLED_ISSUES \
-e CONTINUE_THROUGH_ERROR \
-e VERBOSE_TEST_LOGS \
-e NO_TEST_TIMEOUT \
-e NO_TD \
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
@ -297,7 +307,7 @@ jobs:
path: ./**/core.[1-9]*
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.3
if: always()
# NB: We are currently having an intermittent GPU-related issue on G5 runners with

View File

@ -71,11 +71,11 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Set xcode version
env:
@ -87,7 +87,7 @@ jobs:
- name: Setup miniconda
if: inputs.environment-file == ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -97,7 +97,7 @@ jobs:
# environment even though the arch is x86-64
- name: Setup miniconda using the provided environment file
if: inputs.environment-file != ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3
with:
python-version: ${{ inputs.python-version }}
environment-file: ${{ inputs.environment-file }}
@ -207,4 +207,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

View File

@ -34,12 +34,14 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
keep-going: ${{ steps.filter.outputs.keep-going }}
ci-verbose-test-logs: ${{ steps.filter.outputs.ci-verbose-test-logs }}
ci-no-test-timeout: ${{ steps.filter.outputs.ci-no-test-timeout }}
ci-no-td: ${{ steps.filter.outputs.ci-no-td }}
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
with:
fetch-depth: 1
submodules: false
- name: Select all requested test configurations
@ -79,7 +81,7 @@ jobs:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -95,6 +97,9 @@ jobs:
PY_VERS: 3.9
PR_BODY: ${{ github.event.pull_request.body }}
CONTINUE_THROUGH_ERROR: ${{ needs.filter.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ needs.filter.outputs.ci-verbose-test-logs }}
NO_TEST_TIMEOUT: ${{ needs.filter.outputs.ci-no-test-timeout }}
NO_TD: ${{ needs.filter.outputs.ci-no-td }}
PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt
REENABLED_ISSUES: ${{ needs.filter.outputs.reenabled-issues }}
run: |
@ -154,4 +159,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

View File

@ -79,11 +79,11 @@ jobs:
done
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.3
- name: Download build artifacts
uses: ./.github/actions/download-build-artifacts
@ -91,8 +91,14 @@ jobs:
name: ${{ inputs.build-environment }}
use-gha: true
- name: Download TD artifacts
continue-on-error: true
uses: ./.github/actions/download-td-artifacts
with:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.3
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -148,6 +154,9 @@ jobs:
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
@ -218,4 +227,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.3

Some files were not shown because too many files have changed in this diff Show More