Compare commits

...

4488 Commits

Author SHA1 Message Date
e0c728c545 Changes for release 2.0 only (#94934)
* Changes for release 2.0 only

* Delete the refs during pytorch checkout

* Bug fix and add xla r2.0 hash
2023-02-15 18:08:38 -05:00
dbcd11f3a7 try to fix OSS CI error (#94785) (#94936)
Differential Revision: D43259005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94785
Approved by: https://github.com/weiwangmeta, https://github.com/digantdesai

Co-authored-by: Cuiqing Li <cuiqingli123@meta.com>
2023-02-15 17:47:26 -05:00
3ace14eb8b [Bug fix] sparse_mask: wrong intersection on CUDA (#94829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94829
Approved by: https://github.com/cpuhrsch
2023-02-15 13:22:39 +00:00
0c3ba78568 [FSDP] Fix clip_grad_norm_() when rank has no local gradients (#94835)
`functools.reduce()` requires non-empty input. We need to add a case for `len(grads) == 0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94835
Approved by: https://github.com/zhaojuanmao
2023-02-15 12:28:03 +00:00
8da776e3a7 [FSDP] Fix "use-after-free" in reshard logic (#94859)
**Overview**
This PR switches the order of freeing the unsharded `FlatParameter` (`self._free_unsharded_flat_param()`) and switching to use the sharded `FlatParameter` (`self._use_sharded_flat_param()`). This is to prevent "use-after_free"-type bugs where for `param.data = new_data`, `param` has its metadata intact but not its storage, causing an illegal memory access for any instrumentation that depends on its storage. (`param` is an original parameter and `new_data` is either a view into the sharded `FlatParameter` or `torch.empty(0)` depending on the sharding and rank.)

**Details**
To see why simply switching the order of the two calls is safe, let us examine the calls themselves:
652457b1b7/torch/distributed/fsdp/flat_param.py (L1312-L1339)

652457b1b7/torch/distributed/fsdp/flat_param.py (L1298-L1310)

- `_free_unsharded_flat_param()` does not make any assumption that `self.flat_param`'s data is the sharded `FlatParameter` (i.e. `_local_shard`).
- The sharded `FlatParameter` (i.e. `_local_shard`) is always present in memory, which means that FSDP can use sharded views at any time, including before freeing the unsharded data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94859
Approved by: https://github.com/zhaojuanmao, https://github.com/fegin
2023-02-15 12:16:20 +00:00
5a54537918 Add further info to masked_scatter and masked_scatter_ documention (#94545)
Fixes #94353

This PR adds examples and further info to the in-place and out-of-place masked scatter functions' documentation, according to what was proposed in the linked issue. Looking forward to any suggested changes you may have as I continue to familiarize myself with PyTorch 🙂
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94545
Approved by: https://github.com/lezcano
2023-02-15 07:50:47 +00:00
5705199fb1 Update smoke test threshold (#94888)
https://github.com/pytorch/pytorch/pull/94249 touched upon what values we should set. It turns out 1.17 is too high, as seemingly innocent commits are failing to yield 1.17x. They yielded ~1.168x.
https://github.com/pytorch/pytorch/actions/runs/4180998255/jobs/7242758816
<img width="881" alt="image" src="https://user-images.githubusercontent.com/109318740/218951536-476d3764-1aa6-481b-bd92-f55d1c50e385.png">

Setting it to 1.165x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94888
Approved by: https://github.com/ngimel
2023-02-15 07:29:41 +00:00
77d1135566 [ROCm] Pyt 2.0 rocm staging (#94660)
Add triton support for ROCm builds of PyTorch.

* Enables inductor and dynamo when rocm is detected
* Adds support for pytorch-triton-mlir backend
* Adds check_rocm support for verify_dynamo.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94660
Approved by: https://github.com/malfet
2023-02-15 06:15:18 +00:00
71ec2617d2 [MPS] Block uint8 data type for unary and binary ops on macOS 12 (#94876)
Blocks uint8 data type for unary and binary ops on macOS 12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94876
Approved by: https://github.com/kulinseth
2023-02-15 06:09:56 +00:00
8261c600b7 Update ideep to add primitive cache for ARM (#94719)
### Description
This PR is to update ideep to add primitive cache in order to speed up ARM's PyTorch workloads.
Fixes #94264.

### Performance test
Use TorchBench test in ICX with 40 cores
Intel OpenMP & jemalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/218937895-c97f5a5f-644b-4113-a3f5-7fe11fad7516.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94719
Approved by: https://github.com/jgong5
2023-02-15 05:46:39 +00:00
c10acb834d Revert "Temporarily disable inductor torchbench test (#94873)"
This reverts commit 79b7c697a48128265162f6112b4ef534683d2ce1.

Reverted https://github.com/pytorch/pytorch/pull/94873 on behalf of https://github.com/kit1980 due to The tests should pass now
2023-02-15 04:22:06 +00:00
e0a954f531 call zero_grad in foreach/fused optimizers tests (#94724)
the tests calling this method haven't failed because `iter` is a built-in function's name

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94724
Approved by: https://github.com/Skylion007
2023-02-15 04:14:34 +00:00
afadc3697a [ONNX] Fix assert in cat (#94870)
The assert statement blocks tensors with unknown ranks. This change unblocks those cases. Needed for https://github.com/pytorch/vision/pull/7056

Verified against https://github.com/pytorch/vision/pull/7056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94870
Approved by: https://github.com/BowenBao
2023-02-15 04:09:59 +00:00
3d5f4dcc4d Update vision commit pin (#94874)
To 0bdd01a79a that removes usage of `torch._six`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94874
Approved by: https://github.com/kit1980
2023-02-15 03:27:48 +00:00
117fafc260 [CI] Install pytorch-cuda for conda testing (#94852)
Also, install it from the nightly channel, if `TORCH_CONDA_BUILD_FOLDER` is set to nightly

Discovered after doing a bit more GPU smoke testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94852
Approved by: https://github.com/atalman, https://github.com/Skylion007
2023-02-15 03:14:32 +00:00
79b7c697a4 Temporarily disable inductor torchbench test (#94873)
The test is failing with "ModuleNotFoundError: No module named 'torchbenchmark.models.fb'" because of some updates of torchbench deps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94873
Approved by: https://github.com/malfet
2023-02-15 02:07:08 +00:00
abf59f5703 Make _simplified kwarg private (#94782)
CR on https://github.com/pytorch/pytorch/pull/94404

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94782
Approved by: https://github.com/voznesenskym
2023-02-15 01:52:16 +00:00
ae57bd6630 PT2/TorchScript interoperability fix (#94678)
Allows torch.compile() to inline into ScriptFunction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94678
Approved by: https://github.com/ezyang
2023-02-15 01:21:10 +00:00
b6443fca86 [ONNX] Wrap op validation inputs and add export_options.py and function_dispatcher.py (#94721)
1. `_validate_op_between_ort_torch` inputs was not wrapped (preprocessed) properly.
2. Introduce function_dispatcher.py to store decompistion table (atn/prim) and ATenLib
3. Introduce ~~export_options.py~~ options.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94721
Approved by: https://github.com/BowenBao
2023-02-15 00:59:59 +00:00
5bc72bd019 sym_int simplification for integer args, attempt 3 (#94799)
Per title, now propagates to inductor codegen.
Where should I put the test and how should test look like?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94799
Approved by: https://github.com/ezyang
2023-02-15 00:31:19 +00:00
65b998325c [inductor] Disable developer warnings for "2.0.0" version (#94845)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94845
Approved by: https://github.com/wconstab
2023-02-15 00:09:26 +00:00
7f7f91e36f add reproducibility notes to nn.UnpoolND operations (#94629)
In response to some comments here: #80827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94629
Approved by: https://github.com/albanD
2023-02-15 00:06:48 +00:00
7c44823a4e Fix layout/device checks in sparse-dense addmm (#94843)
Resolves #94684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94843
Approved by: https://github.com/cpuhrsch
2023-02-14 23:23:26 +00:00
40cb494b1a Switch Docker release to CUDA 11.7 (#94818)
Switch Docker release to CUDA 11.7
Remove `ptxas` installation logic as Trition is now bundled with ptxas
Successful run: https://github.com/pytorch/pytorch/actions/runs/4176843201/jobs/7233661196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94818
Approved by: https://github.com/malfet
2023-02-14 23:10:57 +00:00
98012e4a59 [ROCm] hipGraph support for pytorch mainline (#88202)
With the release of ROCm 5.3 hip now supports a hipGraph implementation.

All necessary backend work and hipification is done to support the same functionality as cudaGraph.

Unit tests are modified to support a new TEST_GRAPH feature which allows us to create a single check for graph support instead of attempted to gather the CUDA level in annotations for every graph test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88202
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2023-02-14 22:18:56 +00:00
79783a51da [torchgen] Loosen the restriction for only allowing 2 nested namespaces for kernels (#94834)
As titled. We still want to have some restriction to avoid misuse but for internal use case we want to change the limit from 2 to 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94834
Approved by: https://github.com/SS-JIA
2023-02-14 21:50:12 +00:00
7ef76ce6c3 Preloads more nvidia pypi library for multi arch distributions (#94355)
Following the same logic of preloading cudnn and cublas from the pypi folder in multi-arch disributions, where Pure-lib vs Plat-lib matters, this PR adds the logic for the rest of the cuda pypi libraries that were integrated.

I have tested this PR by running the code block locally and installing/uninstalling nvidia pypi libraries:

```
import sys
import os

def _preload_cuda_deps():
    """Preloads cudnn/cublas deps if they could not be found otherwise."""
    # Should only be called on Linux if default path resolution have failed

    cuda_libs = {
        'cublas': 'libcublas.so.11',
        'cudnn': 'libcudnn.so.8',
        'cuda_nvrtc': 'libnvrtc.so.11.2',
        'cuda_runtime': 'libcudart.so.11.0',
        'cuda_cupti': 'libcupti.so.11.7',
        'cufft': 'libcufft.so.10',
        'curand': 'libcurand.so.10',
        'cusolver': 'libcusolver.so.11',
        'cusparse': 'libcusparse.so.11',
        'nccl': 'libnccl.so.2',
        'nvtx': 'libnvToolsExt.so.1',
    }
    cuda_libs_paths = {lib_folder: None for lib_folder in cuda_libs.keys()}

    for path in sys.path:
        nvidia_path = os.path.join(path, 'nvidia')
        if not os.path.exists(nvidia_path):
            continue
        for lib_folder, lib_name in cuda_libs.items():
            candidate_path = os.path.join(nvidia_path, lib_folder, 'lib', lib_name)
            if os.path.exists(candidate_path) and not cuda_libs_paths[lib_folder]:
                cuda_libs_paths[lib_folder] = candidate_path
        if all(cuda_libs_paths.values()):
            break
    if not all(cuda_libs_paths.values()):
        none_libs = [lib for lib in cuda_libs_paths if not cuda_libs_paths[lib]]
        raise ValueError(f"{', '.join(none_libs)} not found in the system path {sys.path}")

_preload_cuda_deps()
```

I don't have access to a multi-arch environment, so if somebody could verify a wheel with this patch on a multi-arch distribution, that would be great!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94355
Approved by: https://github.com/atalman
2023-02-14 21:47:33 +00:00
97510c6d50 Convert operator.not_ to torch.logical_not (#94626)
If the input to operator.not_ is a tensor, I want to convert the operator to a torch.logical_not. This allows the following test case to pass. Beforehand it resulted in the error `NotImplementedError("local_scalar_dense/item NYI for torch.bool")`

```
    def test_export_tensor_bool_not(self):
        def true_fn(x, y):
            return x + y

        def false_fn(x, y):
            return x - y

        def f(x, y):
            return cond(not torch.any(x), true_fn, false_fn, [x, y])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94626
Approved by: https://github.com/voznesenskym
2023-02-14 21:45:48 +00:00
69bcefceec [ROCm] Added MIOpen header files to installation package for ROCm. (#92969)
Added MIOpen header files to installation package for building Pytorch extensions that requires MIOpen as a dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92969
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-02-14 21:43:31 +00:00
989299802c Use s3 for some test infra files (#94642)
companion to https://github.com/pytorch/test-infra/pull/2756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94642
Approved by: https://github.com/huydhn
2023-02-14 19:45:41 +00:00
63bf7674fa add backwards for gelu and relu on nested tensors. (#94776)
Fixes #94701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94776
Approved by: https://github.com/cpuhrsch
2023-02-14 18:42:06 +00:00
b7e1477e9b Improve leaky relu doc (#94090)
Fixes #83821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94090
Approved by: https://github.com/jbschlosser
2023-02-14 17:58:51 +00:00
33f13fc959 Fix XNNPACK missing symbol from post-operation.c (#94768)
Summary: Fix RL team XNNPACK xnn_mutex.h issue.

Test Plan: buck2 test

Reviewed By: kirklandsign

Differential Revision: D43243129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94768
Approved by: https://github.com/kirklandsign, https://github.com/digantdesai
2023-02-14 17:17:39 +00:00
4a5ce921a0 Add HPU to compatible shallow copy list and remove lazy HPU changes (#94673)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94673
Approved by: https://github.com/wconstab
2023-02-14 17:15:25 +00:00
5c64d2141f [ONNX] Add ExportOptions and op_level_debug mode (#94720)
Add op_level_debug for turn on/off op-level validation with ORT during exporting. Also, integration of all exporting setting parameters into ExportOptions class to avoid the complexity of passing around parameters among functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94720
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2023-02-14 16:39:34 +00:00
3fc4bc115f [functorch] jacrev, jacfwd error for complex input or output (#94805)
Related: https://github.com/pytorch/pytorch/issues/94397, https://github.com/pytorch/pytorch/issues/94397#issuecomment-1428452756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94805
Approved by: https://github.com/lezcano
2023-02-14 16:13:37 +00:00
18d93cdc5d [CI] Use prebuilt triton from nightly repo (#94732)
No point in building from source if it was prebuilt already

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94732
Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/huydhn, https://github.com/jansel
2023-02-14 15:51:23 +00:00
57b22bc6d8 [Dynamo] Backend registration with `entry_points` (#93873)
Fixes #91824

This PR add a new dynamo backend registration mechanism through ``entry_points``. The ``entry_points`` of a package is provides a way for the package to reigster a plugin for another one.

The docs of the new mechanism:
![image](https://user-images.githubusercontent.com/23381083/216133221-18cf18e2-6ad6-4cf7-8da2-9b9b883389c8.png)
(the typo '...named "my_backend" that has been..." has been fixed to '...named "my_compiler" that has been...')

# Discussion

## About the test
I did not add a test for this PR as it is hard either to install a fack package during a test or manually hack the entry points function by replacing it with a fake one. I have tested this PR offline with the hidet compiler and it works fine. Please let me know if you have any good idea to test this PR.

## About the dependency of ``importlib_metadata``
This PR will add a dependency ``importlib_metadata`` for the python < 3.10 because the modern usage of ``importlib`` gets stable at this python version (see the documentation of the importlib package [here](https://docs.python.org/3/library/importlib.html)).  For python < 3.10, the package ``importlib_metadata`` implements the feature of ``importlib``. The current PR will hint the user to install this ``importlib_metata`` if their python version < 3.10.

## About the name and docs
Please let me know how do you think the name ``torch_dynamo_backend`` as the entry point group name and the documentation of this registration mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93873
Approved by: https://github.com/malfet, https://github.com/jansel
2023-02-14 15:44:25 +00:00
94f0808629 [MPS] Add fmod op. (#94722)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94722
Approved by: https://github.com/DenisVieriu97
2023-02-14 14:55:26 +00:00
d1d5d16df3 dynamo: handle straight-line graph breaks for autocast context manager with constant args (#94137)
Fixes https://github.com/pytorch/pytorch/issues/93890

We do the following:
1. fix __init__constructor for `AutocastModeVariable` with exisiting `mode` while copying
2. `resume_execution` is made aware of constant args (`target_values`), by storing said args in `ReenterWith`. To propagate between subgraphs (in straightline code), we also store the constant args in the downstream's `code_options["co_consts"]` if not already.

---

Future work:
1. handle instantiating context manager in non-inlineable functions. Simultaneously fix nested grad mode bug.
2. generalize to general `ContextManager`s
3. generalize to variable arguments passed to context manager, with guards around the variable.

---

Actually, if we look at the repro: 74592a43d0/test/dynamo/test_repros.py (L1249), we can see that the method in this PR doesn't work for graph breaks in function calls, in particular, in function calls that don't get inlined.

Why inlining functions with graph breaks is hard:
- When we handle graph breaks, we create a new code object for the remainder of the code. It's hard to imagine doing this when you are inside a function, then we need a frame stack. And we just want to deal with the current frame as a sequence of straight line codes.

Why propagating context manager information is hard:
- If we do not inline the function, the frame does not contain any information about the parent `block_stack` or `co_consts`. So we cannot store it on local objects like the eval frame. It has to be a global object in the output_graph.

---

Anyway, I'm starting to see clearly that dynamo must indeed be optimized for torch use-case. Supporting more general cases tends to run into endless corner-cases and caveats.

One direction that I see as viable to handle function calls which have graph breaks and `has_tensor_in_frame` is stick with not inlining them, while installing a global `ContextManagerManager`, similar to the `CleanupManager` (which cleans up global variables). We can know which context managers are active at any given point, so that we can install their setup/teardown code on those functions and their fragments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94137
Approved by: https://github.com/yanboliang
2023-02-14 14:00:37 +00:00
73ee4964d3 Add new checks in CI system to verify the built linux pip wheel with cpu-cxx11-abi (#79409)
We added the linux pip wheel with cpu-cxx11-abi in pytorch/builder, see: https://github.com/pytorch/builder/pull/990 and https://github.com/pytorch/builder/pull/1023

The purpose of this PR is to add new checks in pytorch CI system to verify the linux pip wheel with cpu-cxx11-abi.

Co-authored-by: Zhu Hong <hong.zhu@intel.com>
Co-authored-by: Guo Yejun <yejun.guo@intel.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79409
Approved by: https://github.com/malfet
2023-02-14 12:59:03 +00:00
22e2fd554c OpInfo for aten.exponential, Add check for dtype, parameter in decomp ref (#92709)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92709
Approved by: https://github.com/lezcano
2023-02-14 10:11:07 +00:00
1dbaa5c290 Use decompositions for some fallbacks introduced in #94039 (#94206)
In some cases, implements required inductor primitives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94206
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-02-14 09:31:30 +00:00
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
39511697d4 [PT-D][BE] Update 2D parallelism API name and docs (#94771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94771
Approved by: https://github.com/wanchaol
2023-02-14 08:13:15 +00:00
53062e1fe4 inductor: fix size and stride comparison (#94481)
We met a case where `old.get_stride()` is a `tuple`: `(1, 16)` while `new.get_stride()` is a `list`: `[1, 16]`.
`old.get_stride() == new.get_stride()` returns `False` though they're actually equal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94481
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
2023-02-14 07:14:20 +00:00
28ed0bdb37 Revert "[tp] additional doc fixes (#94786)"
This reverts commit 7522ca55f19e8646f3e5cb59d2673fb0b46696c7.

Reverted https://github.com/pytorch/pytorch/pull/94786 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the doc failure looks related and they are also failing in trunk 7522ca55f1
2023-02-14 05:43:37 +00:00
bafc4e377b [vision hash update] update the pinned vision hash (#94784)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94784
Approved by: https://github.com/pytorchbot
2023-02-14 05:30:55 +00:00
5cd2b65816 [inductor] fix sympy.core.numbers.Expr (#94780)
Summary: Fix sympy.core.numbers.Expr, sympy.core has no module 'numbers'

Test Plan: sandcastle

Differential Revision: D43254644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94780
Approved by: https://github.com/bertmaher
2023-02-14 05:18:49 +00:00
7522ca55f1 [tp] additional doc fixes (#94786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94786
Approved by: https://github.com/fduwjj
2023-02-14 04:52:04 +00:00
1f06a71797 [MPS] Error out for square int64 input (#94766)
- add checks for whether macOS is greater than 13.2
- remove square from block list
- throw error messages if power int64 is called before macOS 13.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94766
Approved by: https://github.com/kulinseth
2023-02-14 04:45:41 +00:00
d567df9f36 [dynamo 3.11] remap dup/rotate to copy/swap (#93988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93988
Approved by: https://github.com/jansel, https://github.com/albanD, https://github.com/mlazos
2023-02-14 04:25:14 +00:00
751bab094a [dynamo 3.11] support new binary ops (#93987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93987
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/albanD
2023-02-14 04:25:14 +00:00
d4d13d99e4 [dynamo 3.11] support new jump opcodes (#93986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93986
Approved by: https://github.com/jansel, https://github.com/albanD, https://github.com/malfet, https://github.com/voznesenskym
2023-02-14 04:25:14 +00:00
3faa636196 Clarify the instructions for setting up dev environment [skip ci] (#94155)
The `requirement.txt` file is in the PyTorch directory. The instructions to `clone` and `cd` to the PyTorch directory are in the later section under Get the PyTorch Source. So, the instructions as such gives an error that requirement.txt is not found.
```ERROR: Could not open requirements file: .. No such file or directory: 'requirements.txt' ```

This PR clarifies the usage of the command.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94155
Approved by: https://github.com/malfet
2023-02-14 03:56:11 +00:00
055dc72dba [ONNX] Bump onnx to 1.13.1, onnxruntime to 1.14.0 (#94767)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94767
Approved by: https://github.com/abock
2023-02-14 03:53:05 +00:00
7e3f79914c Support functionalization for torch.map (#94558)
We restrict:
* Output of each map iteration aliasing the input
* In-place mutation on the list element or inputs given to the map function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94558
Approved by: https://github.com/tugsbayasgalan
2023-02-14 02:40:38 +00:00
3ea59b68af [c10d] Enhance broadcastUniqueNCCLID error reporting (#94752)
When this error is hit, usually it is because rank 0 has hit an error
and crashed before setting the unique ID on rank 0. However, in many job
scheduling tools the rank 0 error is not clearly reported and user must look
for it, so add a small log reminding users to do so.

Differential Revision: [D43245190](https://our.internmc.facebook.com/intern/diff/D43245190/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43245190/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94752
Approved by: https://github.com/H-Huang
2023-02-14 02:00:58 +00:00
ce474bc643 fix view + detach graph case for inductor (#94744)
fixes https://github.com/pytorch/pytorch/issues/94175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94744
Approved by: https://github.com/ezyang
2023-02-14 01:35:23 +00:00
9fb9219478 Make DDPOptimizer work with torch._dynamo.explain() (#94749)
GraphModules that were created during DDPOptimizer graph breaking
lacked `compile_subgraph_reason`, which caused an exception when
running .explain().

Now the reason is provided and users can use .explain() to find out
that DDPOptimizer is causing graph breaks.

Fixes #94579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94749
Approved by: https://github.com/voznesenskym
2023-02-14 01:33:47 +00:00
fb55f12cb0 [cpu][inductor] improve cpu vec implementations of cos & sin (#94577)
The current Torchinductor's `cos` & `sin` implementations will call `sleef` functions in `aten::Vec` which show worse performance than Aten's `cos` & `sin` implementations that invoke `MKL` functions. The reason is that the `sleef` algorithms sacrifice performance in order to have a higher precision. This PR changes Torchinductor's `cos` & `sin` implementations from the `sleef` functions with `1.0` ULP error bound to the ones with `3.5` ULP error bound.

**Performance data for eager v.s. inductor:**
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link=blue vlink=purple>

suite=huggingface |   |   |   |   |  
-- | -- | -- | -- | -- | --
op | improved_ratio | speedup_old | RSD(3) | speedup_new | RSD(3)
cos | 62.12% | 0.653826147 | 4.48% | 1.059999006 | 3.38%
sin | 38.12% | 0.745482927 | 0.72% | 1.029642026 | 5.33%

</body>

</html>

**Accuracy data for eager v.s. inductor:**
Each tol has been tested for 1000 times.
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link=blue vlink=purple>

error_bound | tol=1e-7 | tol=1e-8
-- | -- | --
1.0 ULP | PASS | FAIL
3.5 ULP | PASS | FAIL

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94577
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/Chillee, https://github.com/desertfire, https://github.com/jansel
2023-02-14 01:33:13 +00:00
cedb7e3d77 [MPS] Fix remainder op for integral dtypes (#94757)
Map remainder op to the same template as div (integral dtypes will be cast to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94757
Approved by: https://github.com/kulinseth
2023-02-14 01:06:49 +00:00
84a5aec8c6 [ONNX] Add bloom ops (#94761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94761
Approved by: https://github.com/justinchuby
2023-02-14 00:40:13 +00:00
5ed7c701a3 [ONNX] Remove the deprecated monkey patches to torch.Graph (#94747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94747
Approved by: https://github.com/BowenBao, https://github.com/Skylion007
2023-02-14 00:08:09 +00:00
92f3feabaa fix torch.var backward when n==correction (#94546)
Fixes #94184

This PR, as discussed in [comment ](https://github.com/pytorch/pytorch/issues/94184#issuecomment-1422128166),  returns `x.grad` of same shape as `x`, and filled with `NaN` when the gradient of `torch.var(unbiased=True)` is `NaN`. The gradient of unbiased variance is `NaN` (undefined, divide by zero in the denom `N-1`, where `N` is the number of samples) when `N` is 1 (i.e., there's one sample only -- product of dim is 1 such as `[1]`, `[1,...,1]`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94546
Approved by: https://github.com/soulitzer
2023-02-13 23:38:38 +00:00
86240898de Improve profiling and stack traces for SymNode method calls (#94410)
This restructures the magic methods so that there is a stub `add` that calls the metaprogrammed `_add`. With this change, `SymNode.add` can now show up in stack traces, which is a huge benefit for profiling.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94410
Approved by: https://github.com/Chillee
2023-02-13 23:36:21 +00:00
f1f26fe8ec Streamlining guard expect tests (#94404)
Changes:
* Add `simplified` kwarg to let you only render guards that are nontrivial (excludes duck sizing)
* Make a list of strings valid for sources, if you just have some variable names you want to bind to
* Add test helper `show_guards` using these facilities, switch a few tests to it

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94404
Approved by: https://github.com/Chillee
2023-02-13 23:36:21 +00:00
9d5fcd37a2 sym_max/sym_min introduce guard if hinted (#94400)
This patch started with only the change in `torch/_prims_common/__init__.py`. Unfortunately, this change by itself fails tests. The reason it fails tests is sym_max produces sympy.Max expression, which impedes our ability to actually reason symbolically about the resulting expressions. We much prefer to insert a guard on `l > 1`  and get a Sympy expression without Max in it, if we can. In the upcoming unbacked SymInts PR, we can't necessarily do this, but without unbacked SymInts, we always can.

To do this, we introduce `alternate_impl_if_hinted_methods`. The idea is that if all of the arguments into max/min have hints, we will just go ahead and introduce a guard and then return one argument or the other, depending on the result. This is done by rewrapping the SymNode into SymInt/SymFloat and then running builtins.min/max, but we also could have just manually done the guarding (see also https://github.com/pytorch/pytorch/pull/94365 )

However, a very subtle problem emerges when you do this. When we do builtins min/max, we return the argument SymNode directly, without actually allocating a fresh SymNode. Suppose we do a min-max with a constant (as is the case in `sym_max(l, 1)`. This means that we can return a constant SymNode as the result of the computation. Constant SymNodes get transformed into regular integers, which then subsequently trigger the assert at https://github.com/pytorch/pytorch/pull/94400/files#diff-03557db7303b8540f095b4f0d9cd2280e1f42f534f67d8695f756ec6c02d3ec7L620

After thinking about this a bit, I think the assert is wrong. It should be OK for SymNode methods to return constants. The reason the assert was originally added was that ProxyTensorMode cannot trace a constant return. But this is fine: if you return a constant, no tracing is necessary; you know you have enough guards that it is guaranteed to be a constant no matter what the input arguments are, so you can burn it in. You might also be wondering why a change to SymNode method affects the assert from the dispatch mode dispatch: the call stack typically looks like SymNode.binary_magic_impl -> SymProxyTensorMode -> SymNode.binary_magic_impl again; so you hit the binary_magic_impl twice!

No new tests, the use of sym_max breaks preexisting tests and then the rest of the PR makes the tests pass again.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94400
Approved by: https://github.com/Chillee
2023-02-13 23:36:21 +00:00
4acdc446b2 [MPS] Fix batch norm for NHWC (#94760)
Fixes `test_modules.py` batch norm NHWC testcases:
- `test_memory_format_nn_BatchNorm2d_eval_mode_mps_float32`
- `test_memory_format_nn_BatchNorm2d_eval_mode_mps_float32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94760
Approved by: https://github.com/kulinseth
2023-02-13 23:31:10 +00:00
840fb74ec8 86990 range mps support (#91075)
Fixes #86990

- Added range_mps_out to RangeFactories.mm
- Updated native_functions.yaml
- Added tests in test_mps.py

I did observe that despite [the documentation for torch.range](https://pytorch.org/docs/stable/generated/torch.range.html), the existing implementations do not adjust their return type based off the arguments passed to them. The MPS implementation provided here behaves the same way as the existing CPU and CUDA implementations in this regard, hence the conversion to float32 in the test cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91075
Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97
2023-02-13 23:19:10 +00:00
f2aee8b8d5 small fixes for mlir backend (#94717)
Fixes for skipped tests with mlir triton backend (will unskip once #94249 lands)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94717
Approved by: https://github.com/malfet, https://github.com/atalman
2023-02-13 22:42:53 +00:00
a0d1dbc446 Fix pytest arguments when --save-xml is not passed (#94589)
The expression `argv + [f'--junit-xml-reruns={test_report_path}'] if TEST_SAVE_XML else []` evaluates to the empty list when `TEST_SAVE_XML` is false and would need parentheses.

Instead simplify the code by appending the argument when required directly where `test_report_path` is set.
Note that `.append()` may not be used as that would modify `argv` and in turn `UNITTEST_ARGS` which might have undesired side effects.

Without this patch `pytest.main()` would be called, i.e. no arguments which will try to discover all tests in the current working directory which ultimately leads to (many) failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94589
Approved by: https://github.com/clee2000, https://github.com/Neilblaze
2023-02-13 22:19:51 +00:00
e743d316e2 Revert "fix some MKL detection issues of CMake (#94402)"
This reverts commit 7ef46d40a1208a39d785b1ad772c10d4c6e0af0d.

Reverted https://github.com/pytorch/pytorch/pull/94402 on behalf of https://github.com/malfet due to Broke binary builds, see https://github.com/pytorch/pytorch/issues/94751#issuecomment-1428562517
2023-02-13 22:09:40 +00:00
2db12e3844 [tp] minor update to TP docs (#94748)
minor update to TP docs for beta release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94748
Approved by: https://github.com/fduwjj
2023-02-13 21:54:19 +00:00
8b3e3f937d Update documentation init_process_group optional backend (#94543)
Update documentation for `init_process_group()` to mention the `backend` argument is optional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94543
Approved by: https://github.com/kwen2501
2023-02-13 21:45:38 +00:00
25820b69f6 Revert "[BE] Use data() method when possible as it's safer and more readable (#92755)"
This reverts commit 582485bf0f880de75c7eb36a466562f77e6c64db.

Reverted https://github.com/pytorch/pytorch/pull/92755 on behalf of https://github.com/ezyang due to could have forward fixed but not going to
2023-02-13 21:44:30 +00:00
5ee230face [FSDP][1/N] Refactor module materialization (#94196)
**Overview**
This refactors module materialization (i.e. meta device or `torchdistX` deferred initialization) to compute the parameter and buffer names as needed instead of pre-computing them. These are needed to reacquire references to the states (e.g. `module.get_parameter(param_name)`) after materialization since the materialization may create new variables.

This refactor simplifies `_get_fully_sharded_module_to_states()` (the core function for "pseudo auto wrapping") to better enable lowest common ancestor (LCA) module computation for shared parameters, for which tracking parameter and buffer names may complicate the already non-obvious implementation.

**Discussion**
The tradeoff is a worst case quadratic traversal over modules if materializing all of them. However, since (1) the number of modules is relatively small, (2) the computation per module in the quadratic traversal is negligible, (3) this runs only once per training session, and (4) module materialization targets truly large models, I think this tradeoff is tolerable.

**For Reviewers**
- `_init_param_handle_from_module()` initializes _one_ `FlatParamHandle` from a fully sharded module and represents the module wrapper code path. For this code path, there is no need to reacquire references to the parameters/buffers for now since the managed parameters are only computed after materialization. This works because the managed parameters have a simple definition: any parameter in the local root module's tree excluding those already marked as flattened by FSDP. Similarly, FSDP marks buffers to indicate that they have already been processed (synced if `sync_module_states`).
- `_init_param_handles_from_module()` initializes _all_ `FlatParamHandle`s from a fully sharded module and represents the composable code path. For this code path, we must reacquire references to parameters/buffers because each logical wrapping is specified as a list of parameters/buffers to group together by those variables and because materialization may create new variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94196
Approved by: https://github.com/rohan-varma
2023-02-13 21:43:00 +00:00
6cef200af9 [ONNX] Wrap symbolic method calls with graph context (#94746)
This should address #93370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94746
Approved by: https://github.com/BowenBao
2023-02-13 21:29:39 +00:00
a6a433aecd Add stack emptiness checks inside interpreter.cpp (#94298)
Hi!

I've been fuzzing different pytorch modules, and found a few crashes inside one of them.

Specifically, I'm talking about a module for interpreting the JIT code and a function called `InterpreterState::run()`. Running this function with provided crash file results in a crash, which occurs while calling `dim()` on a `stack` with 0 elements ([line-686](abc54f9314/torch/csrc/jit/runtime/interpreter.cpp (L686))). The crash itself occurs later, when std::move is called with incorrect value of type `IValue`.

The second crash is similar and occurs on [line 328](abc54f9314/torch/csrc/jit/runtime/interpreter.cpp (LL328C15-L328C48)), where `reg(inst.X + i - 1) = pop(stack);` is executed. The error here is the same, `Stack stack` might not contain enough elements.

The third crash occurs on [line 681](abc54f9314/torch/csrc/jit/runtime/interpreter.cpp (L681)). The problem here is the same as for previous crashes. There are not enough elements in the stack.

In addition to these places, there are many others (in the same function) where border checking is also missing. I am not sure what is the best way to fix these problems, however I suggest adding a boundary check inside each of these case statement.

All tests were performed on this pytorch version: [abc54f93145830b502400faa92bec86e05422fbd](abc54f9314)

### How to reproduce

1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch)

2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .`

3. Copy these crash files to the current directory:

    - [crash-4f18c5128c9a5a94343fcbbd543d7d6b02964471.zip](https://github.com/pytorch/pytorch/files/10674143/crash-4f18c5128c9a5a94343fcbbd543d7d6b02964471.zip)
    - [crash-55384dd7c9689ed7b94ac6697cc43db4e0dd905a.zip](https://github.com/pytorch/pytorch/files/10674147/crash-55384dd7c9689ed7b94ac6697cc43db4e0dd905a.zip)
    - [crash-06b6125d01c5f91fae112a1aa7dcc76d71b66576.zip](https://github.com/pytorch/pytorch/files/10674152/crash-06b6125d01c5f91fae112a1aa7dcc76d71b66576.zip)

4. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash``

5. And execute the binary: `/jit_differential_fuzz /homedir/crash-4f18c5128c9a5a94343fcbbd543d7d6b02964471`

After execution completes you will see this stacktrace:

```asan
=36==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6060001657f8 at pc 0x00000060bc91 bp 0x7fff00b33380 sp 0x7fff00b33378
READ of size 4 at 0x6060001657f8 thread T0
    #0 0x60bc90 in c10::IValue::IValue(c10::IValue&&) /pytorch_fuzz/torch/include/ATen/core/ivalue.h:214:43
    #1 0xc20e7cd in torch::jit::pop(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/aten/src/ATen/core/stack.h:102:12
    #2 0xc20e7cd in torch::jit::dim(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/mobile/promoted_prim_ops.cpp:119:20
    #3 0xc893060 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:686:13
    #4 0xc85c47b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:1010:9
    #5 0x600598 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential_fuzz.cc:66:38
    #6 0x601d99 in LLVMFuzzerTestOneInput /jit_differential_fuzz.cc:107:25
    #7 0x52ccf1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #8 0x516c0c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #9 0x51c95b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #10 0x545ef2 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #11 0x7f9ec069a082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #12 0x51152d in _start (/jit_differential_fuzz+0x51152d)

0x6060001657f8 is located 8 bytes to the left of 64-byte region [0x606000165800,0x606000165840)
allocated by thread T0 here:
    #0 0x5fd42d in operator new(unsigned long) /llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
    #1 0xa16ab5 in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<c10::IValue&>(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33
    #2 0xa168f1 in c10::IValue& std::vector<c10::IValue, std::allocator<c10::IValue> >::emplace_back<c10::IValue&>(c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:121:4
    #3 0xc89b53c in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:344:19
    #4 0xc85c47b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/jit/runtime/interpreter.cpp:1010:9
    #5 0x600598 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential_fuzz.cc:66:38
    #6 0x601d99 in LLVMFuzzerTestOneInput /jit_differential_fuzz.cc:107:25
    #7 0x52ccf1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #8 0x516c0c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #9 0x51c95b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #10 0x545ef2 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #11 0x7f9ec069a082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)

SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch_fuzz/torch/include/ATen/core/ivalue.h:214:43 in c10::IValue::IValue(c10::IValue&&)
Shadow bytes around the buggy address:
  0x0c0c80024aa0: fd fd fd fd fd fd fd fa fa fa fa fa 00 00 00 00
  0x0c0c80024ab0: 00 00 00 fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c0c80024ac0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
  0x0c0c80024ad0: fd fd fd fd fd fd fd fd fa fa fa fa fd fd fd fd
  0x0c0c80024ae0: fd fd fd fd fa fa fa fa 00 00 00 00 00 00 00 00
=>0x0c0c80024af0: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa[fa]
  0x0c0c80024b00: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
  0x0c0c80024b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0c80024b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0c80024b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0c80024b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==36==ABORTING
```

6. Executing the remaining crashes gives similar crash reports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94298
Approved by: https://github.com/davidberard98
2023-02-13 21:00:00 +00:00
c0e7077674 Fix link in docs (#94686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94686
Approved by: https://github.com/kit1980
2023-02-13 20:42:24 +00:00
d82c2b14c7 jit trace will fail for parameter check if it contains param whose ki… (#94032)
…nd is _ParameterKind.VAR_KEYWORD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94032
Approved by: https://github.com/qihqi, https://github.com/davidberard98
2023-02-13 20:33:30 +00:00
4d6a4401f8 Raise warning if torch.compile options change without reset (#94680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94680
Approved by: https://github.com/wconstab, https://github.com/malfet
2023-02-13 20:21:04 +00:00
7c3fc2c7f0 Revert "Issue-88098: extract utils from check labels (#94597)"
This reverts commit 2c76838d7ff96cc7aa3a30cae54fded70e0bccc5.

Reverted https://github.com/pytorch/pytorch/pull/94597 on behalf of https://github.com/jeanschmidt due to reverting due internal breakages https://fburl.com/sandcastle/3ukij9xp
2023-02-13 20:19:50 +00:00
1f7448eeda Add missing super().setUp() to test_freezing and test_tensorboard (#94553)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94553
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-02-13 19:56:12 +00:00
bdf9963e57 Cache linter S3 dependencies (#94745)
Fixes https://github.com/pytorch/pytorch/issues/94716
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94745
Approved by: https://github.com/seemethere
2023-02-13 19:44:23 +00:00
36dfbb08f3 Revert "Update Cutlass to v2.11 (#94188)"
This reverts commit a0f9abdcb651bb948d2d6e9f7d3ce947e2c53659.

Reverted https://github.com/pytorch/pytorch/pull/94188 on behalf of https://github.com/ezyang due to bouncing this to derisk branch cut
2023-02-13 19:03:36 +00:00
f70ba23415 [inductor] enable test_upsample_cat_conv_dynamic_shapes (#94715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94715
Approved by: https://github.com/ezyang
2023-02-13 18:29:21 +00:00
0444a6c90a [BE] Remove deprecated logging warn method (#94708)
Swaps all logging.warn calls to logging.warning since the former is deprecated and even raises a deprecation warning now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94708
Approved by: https://github.com/ezyang
2023-02-13 18:24:52 +00:00
ae7a628b03 Dynamic shapes CI updates (#94690)
Data from https://github.com/pytorch/pytorch/pull/94683

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94690
Approved by: https://github.com/cpuhrsch
2023-02-13 18:20:12 +00:00
e355a5c1d6 inductor: fix the CPP issue of flag_to_float (#94730)
Fix https://github.com/pytorch/pytorch/issues/94725.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94730
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel
2023-02-13 18:13:28 +00:00
b57e6fdb50 [MPS] Enable Memory Leak Detection for test_mps.py (#94646)
- To check for Memory Leaks in `test_mps.py`, set the env-variable `PYTORCH_TEST_MPS_MEM_LEAK_CHECK=1` when running test_mps.py (used CUDA code as reference).
- Added support for the following new python interfaces in MPS module:
`torch.mps.[empty_cache(), set_per_process_memory_fraction(), current_allocated_memory(), driver_allocated_memory()]`
- Renamed `_is_mps_on_macos_13_or_newer()` to `_mps_is_on_macos_13_or_newer()`, and `_is_mps_available()` to `_mps_is_available()` to be consistent in naming with prefix `_mps`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94646
Approved by: https://github.com/malfet
2023-02-13 17:56:24 +00:00
ceb0f1576b turn functionalization on in aot_autograd inference (#92857)
still waiting for CI fallout
fixes #90759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92857
Approved by: https://github.com/ezyang
2023-02-13 17:48:00 +00:00
5ce1fad711 Add rnn.unpad_sequence and rnn.unpack_sequence to documentation (#94316)
Fix #76064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94316
Approved by: https://github.com/jbschlosser
2023-02-13 17:47:10 +00:00
701412a4ec Update gradcheck docs to mention non-differentiability (#94618)
Fixes https://github.com/pytorch/pytorch/issues/94204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94618
Approved by: https://github.com/albanD
2023-02-13 17:14:14 +00:00
a064ce1939 Pin setup-buildx-action version. Fix Docker build (#94734)
This pins setup-buildx-action version.
Our Docker builds where fixed by: https://github.com/pytorch/pytorch/pull/92702 on Jan 25,26
However setup-builder-action update on Jan 27 broke these builds again.
This PR pins version of setup-buildx-action and fixes Docker builds for nightly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94734
Approved by: https://github.com/jeanschmidt
2023-02-13 16:58:44 +00:00
216f88d084 ao migration: remove package test as this behavior is tested by other things (#94422)
Summary:

We have tests testing package level migration correctness for torch AO migration.
After reading the code, I noticed that these tests are not testing anything
additional on top of the function level tests we already have.

An upcoming user warning PR will break this test, and it doesn't seem worth fixing.
As long as the function level tests pass, 100% of user functionality will
be tested.  Removing this in a separate PR to keep PRs small.

Test plan:

```
python test/test_quantization.py -k AOMigration
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94422
Approved by: https://github.com/jcaip
2023-02-13 16:33:40 +00:00
f6adbf4d97 ao migration: delete unused test class (#94420)
Summary:

This test case is dead code.  A newer version of this code
exists in `test/quantization/ao_migration/test_quantization.py`. I
think this class must have been mistakenly left during a refactor.
Deleting it.

Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94420
Approved by: https://github.com/jerryzh168
2023-02-13 16:33:40 +00:00
2acac8a83a Logcumsumexp for CUDA (build-time optimized) (#94310)
Hopefully fixes #89205.
This is another version of #90847 where it was reverted because it increases the compile-time significantly.
From my discussion with @ngimel in https://github.com/pytorch/pytorch/pull/93153#issuecomment-1409051528, it seems the option of jiterator would be very tricky if not impossible.
So what I did was to optimize the compile-time in my computer.

To optimize the build time, first I compile the pytorch as a whole, then only change the `LogcumsumexpKernel.cu` file to see how it changes the compile time.
Here are my results for the compilation time of only the `LogcumsumexpKernel.cu` file in my computer:

- Original version (without any complex implementations): 56s (about 1 minute)
- The previous PR (#90847): 13m 57s (about 14 minutes)
- This PR: 3m 35s (about 3.5 minutes)

If the previous PR increases the build time by 30 mins in pytorch's computer, then this PR reduces the increment of build time to about 6 mins. Hopefully this is an acceptable level of build-time increase.

What I did was (sorted by how significant it reduces the build time from the most significant one):

- Substituting `log(x)` to `log1p(x - 1)`. This is applied in the infinite case, so we don't really care about precision.
- Implementing complex exponential manually

tag: @malfet, @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94310
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-13 16:00:52 +00:00
4869929f32 Update Triton hash (#94249)
That includes MLIR + latest packaging changes (that also download ptxas from CUDA-12)
Tweak CI to install gcc-9 to build trition

Disable a few tests to make everything be correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94249
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/weiwangmeta
2023-02-13 13:17:36 +00:00
e61d5b9588 Revert "Dynamo Export use fake tensor (#94276)"
This reverts commit 54fa9801868ae71565b3b237bc2bbcce90e42017.

Reverted https://github.com/pytorch/pytorch/pull/94276 on behalf of https://github.com/jeanschmidt due to break several internal build/test jobs: https://fburl.com/phabricator/1tik7ggb
2023-02-13 09:36:41 +00:00
641dc0b844 Revert "[quant] Add quantize and dequantize operators to decomposition table (#93312)"
This reverts commit 782e4f5c02abaf5b9cdba4eaa827bc70a310bca8.

Reverted https://github.com/pytorch/pytorch/pull/93312 on behalf of https://github.com/jeanschmidt due to this commits breaks internal builds: https://fburl.com/sandcastle/dw0rqcbv
2023-02-13 09:20:37 +00:00
2628901033 [Executorch][Quant] Add Choose_qparams_symmetric (#94685)
Summary: needed for symmetric dynamic quant flow

Test Plan: todo

Reviewed By: jerryzh168

Differential Revision: D43134117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94685
Approved by: https://github.com/larryliu0820
2023-02-13 07:27:48 +00:00
ab261ff514 Tweak config for mode=max-autotune/reduce-overhead (#94659)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94659
Approved by: https://github.com/Chillee
2023-02-13 04:32:25 +00:00
e7e51b3a5c Fix NVML visible device parsing (#92315)
`CUDA_VISIBLE_DEVICES` can contain either ordinals or UUIDs Extend the logic to be able to parse it by UUID

Added unit test to validate that parser and matcher behavior matches that of 525.60.13  driver

Skip MIG- device parsing

Fixes https://github.com/pytorch/pytorch/issues/90543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92315
Approved by: https://github.com/ngimel
2023-02-13 04:25:04 +00:00
6fadd5e94a Checkout torchbench with only needed models (#94578)
Addresses (https://github.com/pytorch/pytorch/pull/93395#issuecomment-1414231011) The perf smoke test is supposed to be around one minute. But the torchbench checkout process is taking more than 15 minutes. This PR explores a way to just checkout torchbench with only needed models that are later used to do perf smoke test and memory compression ratio check.

Torchbench installation has "python install.py models model1 model 2 model3" support to just install model1 model2 and model3, not providing "models model1 model2 model3" would install all models by default.

Before this PR, inductor job takes about 27 minutes (21 minutes spent in testing phase) https://github.com/pytorch/pytorch/actions/runs/4149154553/jobs/7178024253
After this PR, inductor job takes about 19 minutes (12 minutes spent in testing phase), pytorch checkout and docker image pull takes about 5 - 6 minutes total.  https://github.com/pytorch/pytorch/actions/runs/4149155814/jobs/7178735494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94578
Approved by: https://github.com/orionr, https://github.com/malfet, https://github.com/desertfire
2023-02-13 04:02:18 +00:00
18587cb31f [MPS] Add sort and argSort Op. (#94697)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94697
Approved by: https://github.com/DenisVieriu97
2023-02-13 01:03:22 +00:00
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
bdd8f518d7 [MPS] Add Python Module Bindings for the MPS backend (#94417)
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
  `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`

Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
2023-02-12 21:22:30 +00:00
a0f9abdcb6 Update Cutlass to v2.11 (#94188)
Now that we are on CUDA 11+ exclusively, we can update Nvidia's Cutlass to the next version. We also had to remove the cuda build flag : "-D__CUDA_NO_HALF_CONVERSIONS__" since Cutlass no longer builds without it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94188
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-02-12 20:45:03 +00:00
cyy
7ef46d40a1 fix some MKL detection issues of CMake (#94402)
This PR rewrites some logic of FindMKL.cmake and FindOpenMP.cmake to better detect the corresponding libraries and fix the infinitely recursion between them. It also contains some other fixes without changing the CMake interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94402
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-12 19:19:10 +00:00
a8fdfb4ba8 [inductor] Persistent reductions (#92267)
This one may need to wait for the new MLIR Triton to land as it triggers some Triton crashes.

Before:
```
$ pytest test/inductor/test_torchinductor.py -vsk test_softmax_one_kernel_loop_cuda
...
@reduction(
    size_hints=[16, 32],
    reduction_hint=ReductionHint.INNER,
    filename=__file__,
    meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]}
)
@triton.jit
def triton_(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 16
    rnumel = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rbase = tl.arange(0, RBLOCK)[None, :]
    x0 = xindex
    _tmp1 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf")
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp0 = tl.load(in_ptr0 + (r1 + (32*x0)), rmask & xmask, eviction_policy='evict_last')
        _tmp1 = tl.where(xmask & rmask & (_tmp1 < tmp0), tmp0, _tmp1)
    tmp1 = tl.max(_tmp1, 1)[:, None]
    _tmp5 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp2 = tl.load(in_ptr0 + (r1 + (32*x0)), rmask & xmask, eviction_policy='evict_last')
        tmp3 = tmp2 - tmp1
        tmp4 = tl.exp(tmp3)
        _tmp5 = tl.where(xmask & rmask, _tmp5 + tmp4, _tmp5)
    tmp5 = tl.sum(_tmp5, 1)[:, None]
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp6 = tl.load(in_ptr0 + (r1 + (32*x0)), rmask & xmask, eviction_policy='evict_last')
        tmp7 = tmp6 - tmp1
        tmp8 = tl.exp(tmp7)
        tmp9 = tmp8 / tmp5
        tl.store(out_ptr2 + (r1 + (32*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp9, rmask & xmask)
```

After
```
$ pytest test/inductor/test_torchinductor.py -vsk test_softmax_one_kernel_persist_cuda
...
@persistent_reduction(
    size_hints=[16, 32],
    reduction_hint=ReductionHint.INNER,
    filename=__file__,
    meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]}
)
@triton.jit
def triton_(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 16
    rnumel = 32
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (32*x0)), rmask & xmask)
    tmp2 = tl.where(xmask & rmask, tmp0, float("-inf"))
    tmp3 = tl.max(tmp2, 1)[:, None]
    tmp4 = tmp0 - tmp3
    tmp5 = tl.exp(tmp4)
    tmp7 = tl.where(xmask & rmask, tmp5, 0)
    tmp8 = tl.sum(tmp7, 1)[:, None]
    tmp9 = tmp5 / tmp8
    tl.store(out_ptr2 + (r1 + (32*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp9, rmask & xmask)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92267
Approved by: https://github.com/Chillee
2023-02-12 17:39:25 +00:00
eb81e7ec22 [FSDP] Avoid printing incorrect warning for _get_param_to_fqns (#94494)
There exist a hack for `_get_param_to_fqns` and `_apply_to_modules`. The condition for the warning of the hack is incorrect and result in overwhelming message for users. This PR fixes the issue.

The original hack is not removed. It will once the support of DMP + FSDP is deprecated.

Differential Revision: [D43135611](https://our.internmc.facebook.com/intern/diff/D43135611/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94494
Approved by: https://github.com/rohan-varma
2023-02-12 17:09:30 +00:00
963d8f547e [FSDP][state_dict] Return tensors instead of FlatParameters to avoid pickling errors (#94637)
After https://github.com/pytorch/pytorch/pull/88913, user-defined parameter states will be pickled. For a FlatParameter, this means `_local_shard` will also be pickled. Since state_dict and load_state_dict only require the tensor, returning the full FlatParameter does not give us any extra benefit. This PR changes the behavior to simply return a view of the FlatParameter.

Differential Revision: [D43205127](https://our.internmc.facebook.com/intern/diff/D43205127/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94637
Approved by: https://github.com/rohan-varma
2023-02-12 16:04:17 +00:00
2c76838d7f Issue-88098: extract utils from check labels (#94597)
Fixes #88098

This is a mirror of the same PR (https://github.com/Goldspear/pytorch/pull/2) that has been reviewed in my fork (due to it's a stacked PR).

======================
## Context

This is the 2nd of the 3 PRs to address issue-88098.

## What Changed
1. Extract comment related utils from trymerge.py to github_utils.py
2. Extract label related utils from trymerge.py and check_labels.py to label_utils.py

## Tests
* pytorch-dummy repo [trymerge run ](https://github.com/Goldspear/pytorch-dummy/actions/runs/4118944174)merged the test PR [OK](https://github.com/Goldspear/pytorch-dummy/pull/2).

## Note to Reviewers
Due to higher degree of complexity involved to extract GitHubPR class, it's worth having a separate issue to handle that part of refactoring. This issue only focusing on refactoring where necessary to ship the functional diff.

* 1st PR: https://github.com/pytorch/pytorch/pull/94179
* 2nd PR: this one
* 3rd PR: https://github.com/Goldspear/pytorch/pull/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94597
Approved by: https://github.com/ZainRizvi
2023-02-12 12:18:53 +00:00
d04fd6b808 inductor: fix customer op _convolution_pointwise_.binary functional error at AOTAutograd (#94581)
This is another try(first is https://github.com/pytorch/pytorch/pull/94172) to fix the warning message when running inductor CPU path:

```
l.  Known situations this can occur are inference mode only compilation involving resize_ or prims (!schema.hasAnyAliasInfo() INTERNAL ASSERT FAILED); if your situation looks different please file a bug to PyTorch.
Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_functorch/aot_autograd.py", line 1377, in aot_wrapper_dedupe
    fw_metadata, _out = run_functionalized_fw_and_collect_metadata(flat_fn)(
  File "/home/xiaobing/pytorch-offical/torch/_functorch/aot_autograd.py", line 578, in inner
    flat_f_outs = f(*flat_f_args)
  File "/home/xiaobing/pytorch-offical/torch/_functorch/aot_autograd.py", line 2455, in functional_call
    out = Interpreter(mod).run(*args[params_len:], **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/fx/interpreter.py", line 136, in run
    self.env[node] = self.run_node(node)
  File "/home/xiaobing/pytorch-offical/torch/fx/interpreter.py", line 177, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/home/xiaobing/pytorch-offical/torch/fx/interpreter.py", line 294, in call_module
    return submod(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_inductor/mkldnn.py", line 344, in forward
    return self._conv_forward(input, other, self.weight, self.bias)
  File "/home/xiaobing/pytorch-offical/torch/_inductor/mkldnn.py", line 327, in _conv_forward
    return torch.ops.mkldnn._convolution_pointwise_(
  File "/home/xiaobing/pytorch-offical/torch/_ops.py", line 499, in __call__
    return self._op(*args, **kwargs or {})
  File "/home/xiaobing/pytorch-offical/torch/_inductor/overrides.py", line 38, in __torch_function__
    return func(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_ops.py", line 499, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: !schema.hasAnyAliasInfo() INTERNAL ASSERT FAILED at "/home/xiaobing/pytorch-offical/aten/src/ATen/FunctionalizeFallbackKernel.cpp":32, please report a bug to PyTorch. mutating and aliasing ops should all have codegen'd kernels

While executing %self_layer2_0_downsample_0 : [#users=2] = call_module[target=self_layer2_0_downsample_0](args = (%self_layer1_1_conv2, %self_layer2_0_conv2), kwargs = {})
Original traceback:
  File "/home/xiaobing/vision/torchvision/models/resnet.py", line 100, in forward
    identity = self.downsample(x)
 |   File "/home/xiaobing/vision/torchvision/models/resnet.py", line 274, in _forward_impl
    x = self.layer2(x)
 |   File "/home/xiaobing/vision/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94581
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-12 10:48:01 +00:00
fe0c7fbcf8 [MPS] Add repeat_interleave to MPS (#88649)
Fixes #87219

Implements new ``repeat_interleave`` function into ``aten/src/ATen/native/mps/operations/Repeat.mm``
Adds it to ``aten/src/ATen/native/native_functions.yaml``
Adds new test ``test_repeat_interleave`` to ``test/test_mps/py``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88649
Approved by: https://github.com/kulinseth
2023-02-12 08:43:55 +00:00
b794fd19c5 [MPS] Add scatter gather kernels (support up to 5 dimensions) (#94663)
Add scatter gather kernels (support up to 5 dimensions)
- Fixes int64 issues for `mH`, `mT`, `T`, `H` on Monterey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94663
Approved by: https://github.com/kulinseth
2023-02-12 08:17:26 +00:00
e3c4cea668 [functorch] Add support on CUDA keys for control flow ops. (#94465)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94465
Approved by: https://github.com/tugsbayasgalan
2023-02-12 06:45:53 +00:00
989fb7c921 [vision hash update] update the pinned vision hash (#94557)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94557
Approved by: https://github.com/pytorchbot
2023-02-12 05:35:13 +00:00
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
54c0f37646 [MPS] Add support for TopK k>16 (#94639)
Fixes: https://github.com/pytorch/pytorch/issues/78915

* Add the topk>16 support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94639
Approved by: https://github.com/DenisVieriu97
2023-02-12 00:57:53 +00:00
ed54a5d06b enable bf16 emb (#94163)
Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163
Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5
2023-02-12 00:05:09 +00:00
020a0fbf62 [MPS] Perf update to convolutions. (#94661)
Map forward conv to depthwise for num_groups == input_channels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94661
Approved by: https://github.com/DenisVieriu97
2023-02-11 22:09:55 +00:00
4a762cb622 [MPS] Fix channels last copies in ELU,ReLU and Hardswish (#94664)
Fixes test_modules.py tests:
```
test_memory_format_nn_Hardswish_mps_float32
test_non_contiguous_tensors_nn_Hardswish_mps_float32
test_memory_format_nn_ReLU_mps_float32
```
Fixes elu when ran with `ChannelsLast` memory format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94664
Approved by: https://github.com/kulinseth
2023-02-11 22:05:21 +00:00
371f587c92 Dockerize lint jobs (#94255)
This is to minimize network flakiness when running lint jobs.  I create a new Docker image for linter and install all linter dependencies there.  After that, all linter jobs are converted to use Nova generic Linux job https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job.yml with the new image.

For the future task: I encounter this issue with the current mypy version we are using and Python 3.11 https://github.com/python/mypy/issues/13627.  Fixing this requires upgrading mypy to a newer version, but that can be done separately (require formatting/fixing `*.py` files with the newer mypy version)

`collect_env` linter job is currently not included here as it needs older Python versions (3.5).  It could also be converted to use the same mechanism (with another Docker image, probably).  This one rarely fails though.

### Testing

BEFORE
https://github.com/pytorch/pytorch/actions/runs/4130366955 took a total of ~14m

AFTER
https://github.com/pytorch/pytorch/actions/runs/4130712385 also takes a total of ~14m
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94255
Approved by: https://github.com/ZainRizvi
2023-02-11 21:56:19 +00:00
abfd293c39 functionalization: fix x.is_contiguous(channels_last) (#94195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94195
Approved by: https://github.com/ezyang
2023-02-11 21:07:08 +00:00
aba4fb9a16 fix functionalization resize stride compute (#94018)
uncovered from an OpInfo in inductor, when I turned on functionalization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94018
Approved by: https://github.com/ezyang
2023-02-11 21:07:08 +00:00
2b36d35b9c add torch.autograd._unsafe_set_version_counter API (#92924)
better description coming soon (but this is meant to fix https://github.com/pytorch/pytorch/issues/91093)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92924
Approved by: https://github.com/ezyang, https://github.com/alanwaketan, https://github.com/albanD
2023-02-11 21:07:08 +00:00
c74f438c01 [MPS] Fix the cat op for NHWC case (#94662)
* add unit test cat with non-contiguous

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94662
Approved by: https://github.com/DenisVieriu97
2023-02-11 19:43:33 +00:00
8ad10eab4d [Dynamo] Fix bug of calling super from class extended from metaclass (#94547)
Fixes #94299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94547
Approved by: https://github.com/jansel
2023-02-11 18:53:17 +00:00
d09cd15216 [Profiler] Defer recording startup python events (take 2) (#91684)
This is my commandeer of https://github.com/pytorch/pytorch/pull/82154 with a couple extra fixes.

The high level idea is that when we start profiling we see python frames which are currently executing, but we don't know what system TID created them. So instead we defer the TID assignment, and then during post processing we peer into the future and use the system TID *of the next* call on that Python TID.

As an aside, it turns out that CPython does some bookkeeping (ee821dcd39/Include/cpython/pystate.h (L159-L165), thanks @dzhulgakov for the pointer), but you'd have to do some extra work at runtime to know how to map their TID to ours so for now I'm going to stick to what I can glean from post processing alone.

As we start observing more threads it becomes more important to be principled about how we start up and shut down. (Since threads may die while the profiler is running.) #82154 had various troubles with segfaults that wound up being related to accessing Python thread pointers which were no longer alive. I've tweaked the startup and shutdown interaction with the CPython interpreter and it should be safer now.

Differential Revision: [D42336292](https://our.internmc.facebook.com/intern/diff/D42336292/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91684
Approved by: https://github.com/chaekit
2023-02-11 18:44:00 +00:00
8d45f555d7 [BE] [1/3] Rewrite super() calls in caffe2 and benchmarks (#94587)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587
Approved by: https://github.com/ezyang
2023-02-11 18:19:48 +00:00
aa6f0ace2f Remove API declarations in Ops.hpp (#94532)
In #91257, we removed direct calls to methods in ops.cpp, so this is updating to also remove ops.hpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94532
Approved by: https://github.com/kwen2501
2023-02-11 18:13:09 +00:00
a27bd42bb9 [ONNX] Use onnxruntime to run fx tests (#94638)
- Enable the mnist test
- Removed `max_pool2d` in the test because we don't have the op yet.
- Add aten::convolution
- Bump onnxscript version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94638
Approved by: https://github.com/BowenBao, https://github.com/wschin, https://github.com/titaiwangms
2023-02-11 15:32:03 +00:00
9dd7e83676 update xnnpack to newer version and update API usage in pytorch (#94330)
Summary:
Update XNNPACK to 51a987591a6fc9f0fc0707077f53d763ac132cbf (51a987591a)

Update the corresponding CMake and BUCK rules, as well as the generate_wrapper.py for the new version.

Due to XNNPACK having already changed a lot. We need to update XNNPACK in this time for many reasons. Firstly, XNNAPCK has updated a lot, and developers' community has re-factored codes' such as API changes. We can see from their cmakefile.txt to see there are many changes! Thus, in order to follow up upstream. We need to update xnnpack at this time. It is very crucial for our future development. Also, many projects are relying on newer versions of XNNPACK, so we probably need to update XNNPACK third-party libs at this time. we have some api changes of XNNPACK, so we also need to update them in this time. We also update target building files and generate-wrapper.py file to make this process more automatically. The original target files have some files which are missing, so we add them into buck2 building files so that it can build and test XNNPACK successfully.

Test Plan:
buck2 build //xplat/third-party/XNNPACK:operators
buck2 build //xplat/third-party/XNNPACK:XNNPACK
buck2 test fbcode//caffe2/test:xnnpack_integration

Reviewed By: digantdesai

Differential Revision: D43092938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94330
Approved by: https://github.com/digantdesai, https://github.com/albanD
2023-02-11 08:59:35 +00:00
e7a8af9376 don't warn on explicit fallback in inductor (#94643)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94643
Approved by: https://github.com/Chillee
2023-02-11 07:29:10 +00:00
4fe365774a Revert "[MPS] Add Python Module Bindings for the MPS backend (#94417)"
This reverts commit beb4f5bf396ec2d53defa73c81aac48c38360544.

Reverted https://github.com/pytorch/pytorch/pull/94417 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to break MacOS test in trunk bae397ec63
2023-02-11 05:24:45 +00:00
77d9e36b0a [ONNX] Reduce 'find_mismatch' memory footprint by promptly freeing past sessions. (#94648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94648
Approved by: https://github.com/justinchuby
2023-02-11 05:06:12 +00:00
7f068b7978 [MPS] Add APIs to query current and driver allocated memory in MPSAllocator (#94649)
- Fixed the formatting in MPSAllocator.mm
- Added `getCurrentAllocatedMemory()`and `getDriverAllocatedMemory()` to query memory allocations required for Memory Leak Detection in test_mps.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94649
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2023-02-11 03:18:52 +00:00
6d1a9d7323 Revert "Mark ROCm trunk job as unstable (#94550)" (#94631)
This reverts commit 79ed6b246c768230aa1bf14eed804c8156a3f87f.

Repo.radeon.com issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94631
Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd
2023-02-11 03:08:41 +00:00
50bc25baa0 Move ValueRanges into its own module (#94528)
I am going to use it in ShapeEnv shortly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94528
Approved by: https://github.com/eellison
2023-02-11 02:54:49 +00:00
bae397ec63 Add filelock to MacOS dependencies (#94647)
This starts to fails on trunk out of nowhere.  Adding filelock dependency to forward fix the issue d0cff06bcb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94647
Approved by: https://github.com/clee2000
2023-02-11 02:14:41 +00:00
07cdea7cda inductor: fix guard_equals (#94506)
Fixes https://github.com/pytorch/pytorch/issues/94268.

In the code before https://github.com/pytorch/pytorch/pull/92609, there was an assertion in the `guard_equals` function.
```python
assert self.size_hint(expr) == 0, (expr, self.size_hint(expr))
```

In https://github.com/pytorch/pytorch/pull/92609, `guard_equals` has been changed to
```python
def guard_equals(self, left: Expr, right: Expr) -> Expr:
    self.shape_env.evaluate_expr(sympy.Eq(left, right))
    return left
```
Considering the case where `left` and `right` are both concrete values for example, `left = 10` and `right = 20`. In the current code, `self.shape_env.evaluate_expr(sympy.Eq(left, right))` will directly return `False`:
a81cf49d97/torch/fx/experimental/symbolic_shapes.py (L1380-L1385)

This returned value is not used anywhere and the `guard_equals` function will still `return left` in this case even though `left != right`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94506
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/Chillee
2023-02-11 01:55:18 +00:00
c1c7eaf52b Prevent sym_int from showing up in FX graph (#94595)
Apply the optimization to floor instead of sym_int

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94595
Approved by: https://github.com/ngimel, https://github.com/bdhirsh
2023-02-11 01:43:05 +00:00
030209088f [MPS] Fix the regression with test_index_select_scalar() (#94645)
The PR #94347 caused a regression in test_mps which this patch fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94645
Approved by: https://github.com/DenisVieriu97
2023-02-11 01:36:51 +00:00
ceab30775b [Inductor] Enable fusion of mutation ops in narrow cases (#94110)
Currently we don't enable fusion of mutation ops in any case (we introduce a `StarDep` to prevent fusion with any upstream readers, to ensure the kernel mutating the buffer is executing after them).

This results in cases like [this](https://gist.github.com/mlazos/3dcfd416033b3459ffea43cb91c117c9) where even though all of the other readers have been fused into a single kernel, the `copy_` is left by itself.

This PR introduces `WeakDep` and a pass after each fusion to see if after fusion there are other dependencies on the upstream fused node which already guarantee that this kernel is fused after the prior readers, if there are, the `WeakDep` is pruned and the kernel performing the mutation can be fused with the upstream kernel. This will allow Inductor to fuse epilogue `copy_`s introduced by functionalization on inference graphs.

[before code](https://gist.github.com/mlazos/3369a11dfd1b5cf5bb255313b710ef5b)
[after code](https://gist.github.com/mlazos/1005d8aeeba56e3a3e1b70cd77773c53)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94110
Approved by: https://github.com/jansel
2023-02-11 01:24:06 +00:00
7ce785b50b [MPS] Fix gelu forward and backward ops (#94529)
Forward pass:
```
fix gelu_out_mps key
add calculation for gelu with tanh
remove gelu from blocklist
```
Backward pass:
```
fix gelu_backward_out_mps key
uniform format
add caculation for tanh approximate backward pass
unblock grad test from blocklist
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94529
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-02-11 00:24:30 +00:00
507b8c3423 [MPS] Native implementation for addr (#94538)
```
addr_out_mps to perform res = betainput + alpha(vec1Xvec2)
move addr f16 to low precision list
move addr none float to unsupported list
add test_addr tests
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94538
Approved by: https://github.com/razarmehr
2023-02-11 00:16:50 +00:00
d51ca38ef0 Run test_serialization serially (for 2xlarge runners) (#94613)
Fixes https://github.com/pytorch/pytorch/issues/92746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94613
Approved by: https://github.com/clee2000
2023-02-11 00:15:10 +00:00
680fc84e7b [dtensor] group public APIs together (#94524)
This PR groups distribute_tensor/module to api.py

rename some to non-public (ToTensor/FromTensor)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94524
Approved by: https://github.com/XilunWu
2023-02-10 23:40:34 +00:00
3d82d8d0ed [BE] Enable more flake8-comprehensions checks (#94601)
I applied some flake8 fixes and enabled checking for them in the linter. I also enabled some checks for my previous comprehensions PR.

This is a follow up to #94323 where I enable the flake8 checkers for the fixes I made and fix a few more of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94601
Approved by: https://github.com/ezyang
2023-02-10 23:40:29 +00:00
0b31ebf9e4 [MPS] Added zero check to inverse & fix for any op to avoid segfault issue (#94551)
Fixes empty placeholder error in inverse op. Change to any op should also resolve previously seen segfaults
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94551
Approved by: https://github.com/kulinseth
2023-02-10 23:39:12 +00:00
45edf9a2ea Reland: [Autograd] Use in-place input accumulation fast path for dense Tensors. (#90217)
Identical to https://github.com/pytorch/pytorch/pull/88339 except with a `.has_storage()` check before `.storage()`.

Differential Revision: [D41737935](https://our.internmc.facebook.com/intern/diff/D41737935/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90217
Approved by: https://github.com/ngimel
2023-02-10 23:29:55 +00:00
beb4f5bf39 [MPS] Add Python Module Bindings for the MPS backend (#94417)
- This PR is a prerequisite for the upcoming Memory Leak Detection PR.
- Enable global manual seeding via `torch.manual_seed()` + test case
- Add `torch.mps.synchronize()` to wait for MPS stream to finish + test case
- Enable the following python interfaces for MPS:
  `torch.mps.[get_rng_state(), set_rng_state(), synchronize(), manual_seed(), seed()]`
- Added some test cases in test_mps.py
- Added `mps.rst` to document the `torch.mps` module.
- Fixed the failure with `test_public_bindings.py`

Description of new files added:
- `torch/csrc/mps/Module.cpp`: implements `torch._C` module functions for `torch.mps` and `torch.backends.mps`.
- `torch/mps/__init__.py`: implements Python bindings for `torch.mps` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94417
Approved by: https://github.com/albanD
2023-02-10 23:18:41 +00:00
d0cff06bcb Call MPSAllocator callbacks when allocation fails. (#94133)
Fixes #87374

@kulinseth and @albanD This makes the MPSAllocator call the MPSAllocatorCallbacks when getting a free buffer and a first try on allocating fails. User can register callbacks that might free a few buffers and an allocation will be retried.

The reason why we need the `recursive_mutex` is that since callbacks are supposed to free memory, they will eventually call free_buffer() that will lock the same `mutex` that's used for allocation. This approach is similar what's used with the `FreeMemoryCallback` in the `CUDACachingAllocator`.

This PR tries to be as minimal as possible, but there could be some additional improvements cleanups, like:

- In current main, there's no way callbacks can be called, so we could probably rename the callback registry to something reflect the same naming in the CudaAllocator:

996cc1c0d0/c10/cuda/CUDACachingAllocator.h (L14-L24)

- Review the EventTypes here:

996cc1c0d0/aten/src/ATen/mps/MPSAllocator.h (L18-L23)

- And IMHO a nice improvement would be if callbacks could be aware of AllocParams, so they can decide to be more agressive or not depending on how much memory is requested. So I'd pass AllocParams in the signature of the executeCallback instance:

996cc1c0d0/aten/src/ATen/mps/MPSAllocator.h (L25)

Let me know if you think we could sneak those changes into this PR or if it's better to propose them in other smaller PR's.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94133
Approved by: https://github.com/kulinseth, https://github.com/razarmehr, https://github.com/albanD
2023-02-10 23:09:21 +00:00
948cd61afc add fallthrough kernel for AutogradMeta key (#94603)
The other `Autograd[Backend]` keys all have fallthrough kernels registered to them, but `AutogradMeta` was missing the fallthrough kernel.

This is a problem for custom ops that don't have autograd support, if you try to run them with meta tensors. If you have a custom op, and register a CPU and a Meta kernel, then:

(1) if you run the op with cpu tensors, it will dispatch straight to the CPU kernel (as expected)

(2) if you run the op with meta tensors, you will error - because we don't have a fallthrough registered to the AutogradMeta key, we will try to dispatch to the AutogradMeta key and error, since the op author hasn't provided an autograd implementation.

Here's a repro that I confirmed now works:

```
import torch
from torch._dispatch.python import enable_python_dispatcher
from torch._subclasses.fake_tensor import FakeTensorMode

lib = torch.library.Library("test", "DEF")
impl_cpu = torch.library.Library("test", "IMPL", "CPU")
impl_meta = torch.library.Library("test", "IMPL", "Meta")

def foo_impl(x):
    return x + 1

lib.define("foo(Tensor a) -> Tensor")
impl_meta.impl("foo", foo_impl)
impl_cpu.impl("foo", foo_impl)

with enable_python_dispatcher():
    a = torch.ones(2, device='meta')
    print("@@@@@")
    b = torch.ops.test.foo.default(a)
    print(b)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94603
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 22:44:52 +00:00
0176405c69 fix: check if double to i64 is in well-formed range (#94290)
Fixes #88951

The output shape of upsample is computed through `(i64)idim * (double)scale` and then casted back to `i64`. If the input scale is ill-formed (say negative number as #88951) which makes `(double)(idim * scale)` to be out of the range for `i64`, the casting will be an undefined behaviour.

To fix it, we just check if `(double)(idim * scale)` can fit into `i64`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94290
Approved by: https://github.com/malfet
2023-02-10 22:35:22 +00:00
3fb08199f6 Remove unnecessary replace on self.expr (#94408)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94408
Approved by: https://github.com/jbschlosser
2023-02-10 22:16:31 +00:00
480e0c0198 Remove anaconda-prune yml files as these have been moved to test-infra (#94610)
Merge after https://github.com/pytorch/test-infra/pull/2691

These workflows would run from test-infra repository instead, after the PR (https://github.com/pytorch/test-infra/pull/2691) is merged.

Not deleting anaconda-prune/ scripts because they may become handy during release if there is need to delete packages (no need to find these scripts in test-infra).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94610
Approved by: https://github.com/atalman
2023-02-10 22:04:40 +00:00
c53bd0dd30 Mitigate broken test_coalesce_reference_cycle test on dynamo (#94622)
The test has been disabled and shows up on https://github.com/pytorch/test-infra/blob/generated-stats/stats/disabled-tests-condensed.json, but then the JSON file downloaded by the runner doesn't seem to have it.

Disable it explicitly to keep trunk green while investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94622
Approved by: https://github.com/weiwangmeta
2023-02-10 21:59:36 +00:00
728dfeee48 [MPS] Fix ops with bool issues in macOS Monterey (#94464)
Summary:
- Remove redundant bool casts from scatter/gather
- Make the workarounds for scatter/gather (for bool/uint8 data types) OS specific - use them only in macOS Monterey, ignore them starting with macOS Ventura
- Make all tensors ranked in scatter

Fixes following tests:
```
test_output_match_slice_scatter_cpu_bool
test_output_match_select_scatter_cpu_bool
test_output_match_diagonal_scatter_cpu_bool
test_output_match_repeat_cpu_bool
test_output_match_rot90_cpu_bool
etc..
```

Still failing on macOS Monterey (needs additional investigation):
```
test_output_match_scatter_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94464
Approved by: https://github.com/kulinseth
2023-02-10 21:36:25 +00:00
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
d14a59b63c [MPS] Update merge rule list. (#94619)
cc. @DenisVieriu97
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94619
Approved by: https://github.com/malfet
2023-02-10 21:07:09 +00:00
25619bdeb6 [ONNX][Experimental] FX Exporter w/ ONNX Script and ATen Lib (#94566)
* Symbolic ONNX Exporter for TB Scale Models.
* Based on ONNX Script and ATen Lib.
* Produces diagnostics in Sarif.

Co-authored-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94566
Approved by: https://github.com/abock
2023-02-10 20:45:01 +00:00
8d8fb7efe7 [ONNX] Update diagnostics system (#94565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94565
Approved by: https://github.com/abock
2023-02-10 20:45:01 +00:00
88d0235b73 [ONNX] Update CI test environment; Add symbolic functions (#94564)
* CI Test environment to install onnx and onnx-script.
* Add symbolic function for `bitwise_or`, `convert_element_type` and `masked_fill_`.
* Update symbolic function for `slice` and `arange`.
* Update .pyi signature for `_jit_pass_onnx_graph_shape_type_inference`.

Co-authored-by: Wei-Sheng Chin <wschin@outlook.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94564
Approved by: https://github.com/abock
2023-02-10 20:44:59 +00:00
c5c7687b74 Allow FakeTensorProp to run on graphs traced with some None inputs (#94569)
Without this tiny change in `torch/_subclasses/fake_tensor.py`, the added test may fail with
```
TypeError: cannot create weak reference to 'NoneType' object
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94569
Approved by: https://github.com/ezyang
2023-02-10 20:38:22 +00:00
534db77e73 Autotune pointwise/reduction in max_autotune mode (#94556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94556
Approved by: https://github.com/ngimel
2023-02-10 19:41:39 +00:00
111c86bfe5 Revert "[CI] Move M1 testing to periodic (#94608)"
This reverts commit 5c16788e5ff5ed1b3eba9c8fde5fc0910c495fa8.

Reverted https://github.com/pytorch/pytorch/pull/94608 on behalf of https://github.com/malfet due to We have more runners now, let's see what will happen
2023-02-10 19:41:04 +00:00
7c4acdad4a [MPS] Fix the crash in huberloss with Float16 (#94567)
- Also fix FP16 correctness issues in several other ops by lowering their FP16 precision in the new list `FP16_LOW_PRECISION_LIST`.
- Add atol/rtol to the `AssertEqual()` of Gradient tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94567
Approved by: https://github.com/kulinseth
2023-02-10 19:20:29 +00:00
d8f4026ebf Continue support sharding pipes in tud.datapipes.iter.grouping as deprecated (#94527)
Summary:
https://github.com/pytorch/pytorch/pull/94095 moves this into `tud.datapipes.iter.sharding`. However, since previously this is a public API, this is a BC break change.

As discussed in https://github.com/pytorch/data/pull/987#issuecomment-1422440049, we will have backward compatbile support but with deprecated warning.

Differential Revision: D43161015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94527
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-02-10 18:42:10 +00:00
5c16788e5f [CI] Move M1 testing to periodic (#94608)
To mitigate https://github.com/pytorch/pytorch/issues/94607

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94608
Approved by: https://github.com/albanD, https://github.com/ZainRizvi, https://github.com/weiwangmeta, https://github.com/huydhn
2023-02-10 18:23:05 +00:00
e116ca93e1 Run test_torchinductor*.py with implicit_fallbacks=False (#94039)
This way it errors out for ops that don't have decomps and
requires you to add explicit fallbacks to lowering.py

Turns out there are a lot, and this commit adds them as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94039
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/ngimel
2023-02-10 18:10:56 +00:00
e44586a78f Pass input tensor __dict__ along to placeholder nodes (#94080)
```
import torch
import torch.nn as nn

import torch._dynamo.config
import torch._inductor.config

def pre_attention_state_ops(input, mems, state):
    lc_key = state[0]
    lc_val = state[1]
    bar = []
    for i in range(0, 4):
        bar2 = []
        for j in range(0, 3):
            bar2.append(
                lc_key + lc_val + torch.tensor([0.1, 0.25, 0.4, 0.5, 0.1])
            )
        bar.append(bar2)

    return bar

mems = torch.tensor([[[1.8364, 0.2724, -1.4917, -0.4367, 0.8640]]])
state = [
    torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]),
    torch.tensor([[[1.0517, 0.3848, -0.6472, 0.0823, 0.9116]]]),
]
i = torch.tensor(
    [
        [0.0313, -0.1487, -0.3846, -0.5321],
        [-1.7073, 1.3331, -0.0890, -1.4935],
        [-0.8314, -0.1862, -0.5935, 1.5232],
    ]
)

torch._dynamo.tag(mems, "MEMS")
torch._dynamo.tag(i, "FOO")
torch._dynamo.tag(state[0], "STATE_0")
torch._dynamo.tag(state[1], "HMMM")

exported = torch._dynamo.export(pre_attention_state_ops, i, mems, state)
out_graph = exported[0]

dynamo_result = out_graph(i, mems, state)
nodes = list(out_graph.graph.nodes)
placeholders = [node for node in nodes if node.op == "placeholder"]
for placeholder in placeholders:
    if "tags" in placeholder.meta:
        print("PLACEHOLDER TAGS?", placeholder.meta["tags"])

```

prints

PLACEHOLDER TAGS? ['STATE_0']
PLACEHOLDER TAGS? ['HMMM']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94080
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-02-10 18:09:41 +00:00
9171f7d4cd [BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520)
Applies some more pyupgrade fixits to PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520
Approved by: https://github.com/ezyang
2023-02-10 18:02:50 +00:00
70026aaad6 [SDPA] update type hint for scaled_dot_product_attention and documentation (#94008)
# Summary
- Adds type hinting support for SDPA
- Updates the documentation adding warnings and notes on the context manager
- Adds scaled_dot_product_attention to the non-linear activation function section of nn.functional docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94008
Approved by: https://github.com/cpuhrsch
2023-02-10 18:02:43 +00:00
9bef1ebb9e Fix div by fp64 scalar issue on xla device (#94459)
This PR fixes https://github.com/pytorch/xla/issues/4574. I'll create a separate test PR in pytorch/xla repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94459
Approved by: https://github.com/ezyang
2023-02-10 17:57:47 +00:00
joe
67513aee6d Cleaning up some logic in tools/shared/cwrap_common.py (#94475)
Noticed some code that needed some adjustment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94475
Approved by: https://github.com/ezyang
2023-02-10 17:49:11 +00:00
51cec7bf52 add compile reason in InstructionTranslator RETURN_VALUE (#94176) (#94367)
add compile reason in InstructionTranslator RETURN_VALUE (#94176)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94367
Approved by: https://github.com/jansel
2023-02-10 17:43:45 +00:00
92d8c4b37c [MPS] Fix cumsum for integral data types (#94530)
- Make intermediate type for cumsum ScalarType::Int: fixes https://github.com/pytorch/pytorch/issues/90635
- Add support for negative dimensions in cumsum: fixes https://github.com/pytorch/pytorch/issues/92329
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94530
Approved by: https://github.com/kulinseth
2023-02-10 17:40:29 +00:00
d990ddadd5 [fx] Fix matching args (#94375)
To match nodes within the graph, the matcher currently flattens the arguments and compares each argument against each other. However, if it believes that a list input contains all literals, it will not flatten the list and will instead compare the list directly against each other. It determines if a list is a literal by checking if the first element is a node. However this doesn't work in some cases (like the test cases I added).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94375
Approved by: https://github.com/SherlockNoMad
2023-02-10 17:37:57 +00:00
db6cfff827 fix: forbid multi-index for index_select over scalar (#94347)
Fixes #88940

According to the [doc](https://pytorch.org/docs/stable/generated/torch.index_select.html):
1. "The returned tensor has the same number of dimensions as the original tensor (`input`). "
2.  "The `dim`th dimension has the same size as the length of `index`; other dimensions have the same size as in the original tensor."

These two conditions cannot be satisfied at the same time if the `input` is a scalar && `index` has multiple values: because a scalar at most holds one element (according to property 1, the output is a scalar), it is impossible to satisfy "The `dim`th dimension has the same size as the length of `index`" when `index` has multiple values.

However, currently, if we do so we either get:

1. Buffer overflow with ASAN;
2. Or (w/o ASAN) silently returns outputs that is not consistent with the doc (`x.index_select(0, torch.Tensor([0, 0, 0]).int())` returns `x`).

As a result, we should explicitly reject such cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94347
Approved by: https://github.com/malfet
2023-02-10 17:17:09 +00:00
0d0ebcdfe5 feature: adding the ability to restore shapes after loading a traced model (#90744)
Adds the ability to store inputs used in tracing models when calling torch.jit.save and restore the input shapes using torch.jit.load if the appropriate variables are set.

Fixes [89185](https://github.com/pytorch/pytorch/issues/89185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90744
Approved by: https://github.com/davidberard98
2023-02-10 17:12:52 +00:00
c7c7238976 Fix bug in unsqueeze_nested stride calculation (#88688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88688
Approved by: https://github.com/cpuhrsch
2023-02-10 17:00:04 +00:00
889a4640a0 [ONNX] Skip import test for experimental files (#94552)
`torch.onnx._internal.fx` is experimental and is not imported when `import torch`/`import torch.onnx`.
Need to skip it in this test as it depends on `onnx-script`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94552
Approved by: https://github.com/kit1980
2023-02-10 15:58:49 +00:00
c620ece726 port sparse_mm.reduce to pytorch and optimize it on CPU (#83727)
### Motivation of this PR

This patch is to migrate `spmm_reduce` from `torch-sparse` (a 3rd party dependency for PyG) to `torch`, which is a response to the initial proposal for fusion of **Gather, Apply Scatter** in Message Passing of GNN inference/training. https://github.com/pytorch/pytorch/issues/71300

**GAS** is the major step for Message Passing, the behavior of **GAS** can be classified into 2 kinds depending on the storage type of `EdgeIndex` which records the connections of nodes:

* COO: the hotspot is `scatter_reduce`
* CSR: the hotspot is `spmm_reduce`

The reduce type can be choose from: "max", "mean", "max",  "min".

extend `torch.sparse.mm` with an `reduce` argument, maps to `torch.sparse_mm.reduce` internally.
`sparse_mm_reduce` is registered under the TensorTypeId of `SparseCsrCPU`, and this operator requires an internal interface `_sparse_mm_reduce_impl` which has dual outputs:
* `out` - the actual output
* `arg_out` - records output indices in the non zero elements if the reduce type is "max" or "min", this is only useful for training. So for inference, it will not be calculated.

### Performance

Benchmark on GCN for obgn-products on Xeon single socket, the workload is improved by `4.3x` with this patch.

Performance benefit for training will be bigger, the original backward impl for `sum|mean` is sequential; the original backward impl for `max|min` is not fused.

#### before:
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
       torch_sparse::spmm_sum        97.09%       56.086s        97.09%       56.088s        6.232s             9
                 aten::linear         0.00%      85.000us         1.38%     795.485ms      88.387ms             9
                 aten::matmul         0.00%      57.000us         1.38%     795.260ms      88.362ms             9
                     aten::mm         1.38%     795.201ms         1.38%     795.203ms      88.356ms             9
                   aten::relu         0.00%      50.000us         0.76%     440.434ms      73.406ms             6
              aten::clamp_min         0.76%     440.384ms         0.76%     440.384ms      73.397ms             6
                   aten::add_         0.57%     327.801ms         0.57%     327.801ms      36.422ms             9
            aten::log_softmax         0.00%      23.000us         0.10%      55.503ms      18.501ms             3
```

#### after
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::spmm_sum        87.35%       11.826s        87.36%       11.827s        1.314s             9
                 aten::linear         0.00%      92.000us         5.87%     794.451ms      88.272ms             9
                 aten::matmul         0.00%      62.000us         5.87%     794.208ms      88.245ms             9
                     aten::mm         5.87%     794.143ms         5.87%     794.146ms      88.238ms             9
                   aten::relu         0.00%      53.000us         3.35%     452.977ms      75.496ms             6
              aten::clamp_min         3.35%     452.924ms         3.35%     452.924ms      75.487ms             6
                   aten::add_         2.58%     348.663ms         2.58%     348.663ms      38.740ms             9
                 aten::argmax         0.42%      57.473ms         0.42%      57.475ms      14.369ms             4
            aten::log_softmax         0.00%      22.000us         0.39%      52.605ms      17.535ms             3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83727
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch, https://github.com/rusty1s, https://github.com/pearu
2023-02-10 15:56:40 +00:00
24ae50bcc7 Add config option to reduce warnings in inductor (#94413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94413
Approved by: https://github.com/ezyang
2023-02-10 15:44:15 +00:00
1d3980656c [MPS] Fix min/max_reduction_with_dim ops (#94386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94386
Approved by: https://github.com/DenisVieriu97, https://github.com/razarmehr
2023-02-10 15:23:47 +00:00
0fe11589df [MPS] Add im2col and col2im to Fallback (#94491)
These are not in the hot path  as they are mostly used in Preprocessing layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94491
Approved by: https://github.com/razarmehr
2023-02-10 15:22:59 +00:00
a21bddcc90 WelfordOps: Remove combine_t and use acc_scalar_t instead (#94522)
`combine_t` is the type used to represent the number of elements seen so far as
a floating point value (acc.nf). It is always used in calculations with other
values of type `acc_scalar_t` so there is no performance gained by making this a
separate template argument. Furthermore, when calculating the variance on CUDA
it is always set to `float` which means values are unnecessarily truncated
before being immediately promoted to `double`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94522
Approved by: https://github.com/ngimel
2023-02-10 15:19:46 +00:00
e22e323bea [decomp] Use var_mean in native_batch_norm decomposition (#94140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94140
Approved by: https://github.com/ngimel
2023-02-10 15:19:46 +00:00
e844120b2f Fix embedding_dense_backward to not cast indiices to floats (#94572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94572
Approved by: https://github.com/ngimel
2023-02-10 12:44:03 +00:00
1770ccf6c8 Don't throw tf32 warning if no nodes in graph are matmuls + fp32 + cuda (#94561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94561
Approved by: https://github.com/ngimel, https://github.com/eellison, https://github.com/malfet
2023-02-10 12:44:03 +00:00
f152a79be9 Revert "update aten op overload to not use from to avoid compile errors (#89797)"
This reverts commit 021d2676941976d6a35a3b0e2034238889a6c872.

Reverted https://github.com/pytorch/pytorch/pull/89797 on behalf of https://github.com/jeanschmidt due to breaking internal builds - more details on https://fburl.com/sandcastle/bz8mgkil
2023-02-10 11:32:25 +00:00
a5daea69fb teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-10 11:21:57 +00:00
02b8a7f473 inductor: don't do transpose vectoriztion if input ld depends on most inner var (#94493)
Fixed https://github.com/pytorch/pytorch/issues/94269.

For the following case:

```
**import torch
import torchvision
#import intel_extension_for_pytorch

import torch._dynamo
from torch._inductor import config

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        constant_pad_nd = x
        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:195, code: kv = kv.unfold(2, self.win_size, self.block_size).unfold(3, self.win_size, self.block_size)
        as_strided: f32[1, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(constant_pad_nd, [1, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]);  constant_pad_nd = None
        as_strided_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.as_strided.default(as_strided, [1, 384, 2, 2, 12, 12], [153600, 1, 61440, 3072, 7680, 384]);  as_strided = None

        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:197, code: kv = kv.reshape(
        clone_1: f32[1, 384, 2, 2, 12, 12] = torch.ops.aten.clone.default(as_strided_1, memory_format = torch.contiguous_format);  as_strided_1 = None
        _unsafe_view_1: f32[8, 48, 4, 144] = torch.ops.aten._unsafe_view.default(clone_1, [8, 48, 4, 144]);  clone_1 = None
        permute_2: f32[8, 4, 144, 48] = torch.ops.aten.permute.default(_unsafe_view_1, [0, 2, 3, 1]);  _unsafe_view_1 = None
        # File: /home/xiaobing/miniconda3/envs/pytorch_te_binary/lib/python3.8/site-packages/timm/models/layers/halo_attn.py:202, code: k, v = torch.split(kv, [self.dim_head_qk, self.dim_head_v], dim=-1)
        split_with_sizes = torch.ops.aten.split_with_sizes.default(permute_2, [16, 32], -1);  permute_2 = None
        getitem: f32[8, 4, 144, 16] = split_with_sizes[0]
        getitem_1: f32[8, 4, 144, 32] = split_with_sizes[1];  split_with_sizes = None
        permute_3: f32[8, 4, 16, 144] = torch.ops.aten.permute.default(getitem, [0, 1, 3, 2]);  getitem = None
        expand_1: f32[8, 4, 16, 144] = torch.ops.aten.expand.default(permute_3, [8, 4, 16, 144]);  permute_3 = None
        clone_3: f32[8, 4, 16, 144] = torch.ops.aten.clone.default(expand_1, memory_format = torch.contiguous_format);  expand_1 = None
        return clone_3

model = Model().eval()
opt_model = torch._dynamo.optimize('inductor')(model)
x = torch.randn(1, 384, 20, 20).to(memory_format=torch.channels_last)

ref = model(x)

with torch.no_grad():
    for i in range(3):
        out = opt_model(x)

print(torch.equal(ref, out))
```

The generated code before this PR is:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ni/cniims6nap7c5wars7cmtbjr3mw6b5cxyoyxmsu7ro2l5fkrwatl.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<4; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<1; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<9; i3+=1)
                    {
                        float tmp0[16*16] __attribute__ ((aligned (16)));
                        at::vec::transpose_mxn<float,16,16>(in_ptr0 + (16*i2) + (48*i0) + (384*((16*i3) % 12)) + (3072*(i1 % 2)) + (7680*(((4*i3) / 3))) + (61440*(i1 / 2)), ((-7680)*(i3 / 12)) + ((-384)*(i3 % 12)) + (384*((1 + i3) % 12)) + (7680*(((1 + i3) / 12))), tmp0, 16);
                        for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                        {
                            auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + 16*i2_inner);
                            tmp1.store(out_ptr0 + (16*i3) + (144*i2_inner) + (2304*i1) + (2304*i2) + (9216*i0));
                        }
                    }
                    #pragma GCC ivdep
                    for(long i3=144; i3<144; i3+=1)
                    {
                        for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                        {
                            auto tmp0 = in_ptr0[i2_inner + (16*i2) + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                            out_ptr0[i3 + (144*i2_inner) + (2304*i1) + (2304*i2) + (9216*i0)] = tmp0;
                        }
                    }
                }
                #pragma GCC ivdep
                for(long i2=16; i2<16; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<144; i3+=1)
                    {
                        auto tmp0 = in_ptr0[i2 + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                        out_ptr0[i3 + (144*i2) + (2304*i1) + (9216*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )
```

After:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dm/cdmaihqxwe73zkb3he2zizktpq5uujetg2db26c3r4lgsmlx3b4c.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<4; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<16; i2+=1)
                {
                    #pragma GCC ivdep
                    for(long i3=0; i3<144; i3+=1)
                    {
                        auto tmp0 = in_ptr0[i2 + (48*i0) + (384*(i3 % 12)) + (3072*(i1 % 2)) + (7680*(i3 / 12)) + (61440*(i1 / 2))];
                        out_ptr0[i3 + (144*i2) + (2304*i1) + (9216*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 4, 16, 144), (9216, 2304, 144, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94493
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang
2023-02-10 09:04:45 +00:00
3a12b16fb0 Renamed passes to options in torch.compile (#94500)
@jansel expressed a preference for this (as most of our options are *not* passes), and I agree. I also think that `fullgraph` could be changed, but I don't know what I'd change it to. I considered `strict`, but some folks objected to that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94500
Approved by: https://github.com/voznesenskym, https://github.com/soumith, https://github.com/jansel
2023-02-10 08:19:41 +00:00
59e8756676 [MPS] Fix the Channels last bug with GradientWithInput. (#94384)
* Fix the Channels last bug with GradientWithInput.
The bug was mentioned in :
https://github.com/pytorch/pytorch/issues/77764#issuecomment-1312241902
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94384
Approved by: https://github.com/razarmehr
2023-02-10 07:36:06 +00:00
8dbe63c99e [MPS] Casting int64 to int32 for reduction ops and raise warning. (#94484)
Currently casting it as a workaround till we have full support in OS.
Fixes #https://github.com/pytorch/pytorch/pull/88319#issuecomment-1424010624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94484
Approved by: https://github.com/razarmehr
2023-02-10 07:34:58 +00:00
715f3733ef don't call floor for symint unless necessary (#94365)
Per @ezyang's advice, added magic sym_int method. This works for 1.0 * s0 optimization, but can't evaluate `a>0` for some args, and still misses some optimization that model rewrite achieves, so swin still fails
(rewrite replaces `B = int(windows.shape[0] / (H * W / window_size / window_size))` with `B = (windows.shape[0] // int(H * W / window_size / window_size))` and model passes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94365
Approved by: https://github.com/ezyang
2023-02-10 07:17:11 +00:00
89df0e4253 Enable Python-3.11 binary builds across the board (#94430)
Most of the work is outside of repositories and consists of cloning projects https://github.com/AnacondaRecipes/ and building:
- [typing_extensions](https://github.com/AnacondaRecipes/typing_extensions-feedstock)
- [pyyaml](https://github.com/AnacondaRecipes/pyyaml-feedstock)
- [setuptools](https://github.com/AnacondaRecipes/setuptools-feedstock) v 59.8.0, needed to build `numpy`. Trick here is to add `add_pip_as_python_dependency: off` to ones `.condarc`
- [cython](https://github.com/AnacondaRecipes/cython-feedstock)
- [mkl-service](https://github.com/AnacondaRecipes/mkl-service-feedstock)
- [numpy-base](https://github.com/AnacondaRecipes/numpy-feedstock) (against mkl-2021.4), i.e. add `blas_impl: "mkl"` and `mkl: ">=2021.4.0,<2022.0a0"` to ones `conda_build_config.yaml`
- [mkl_random](https://github.com/AnacondaRecipes/mkl_random-feedstock)
- [mkl_fft](https://github.com/AnacondaRecipes/mkl_fft-feedstock)
- [numpy](https://github.com/AnacondaRecipes/numpy-feedstock)
- [mpmath](https://github.com/AnacondaRecipes/mpmath-feedstock)
- [sympy](https://github.com/AnacondaRecipes/sympy-feedstock)

Anaconda build system is really modern, so in order to be able to build:
- x86 MacOS packages, one need to install Macos 10.10 SDK from 2014, still available at https://github.com/phracker/MacOSX-SDKs/releases and reference it as conda build sysroot, as follows: `CONDA_BUILD_SYSROOT: /Library/Developer/CommandLineTools/SDKs/MacOSX10.10.sdk`
- Windows packages "MSVC v141 - VS 2017 C++ x64/86 build tools (v14.16)" is needed, which likely is still available as Visual Studio component

As well as make a pretty trivial tweak to build rules in cf4fa8900b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94430
Approved by: https://github.com/seemethere, https://github.com/weiwangmeta, https://github.com/albanD, https://github.com/atalman
2023-02-10 06:10:27 +00:00
a1f15fb987 [MPS] Fix batchnorm forward and backward pass (#94351)
Fixes batchnorm forward/backward pass and layer_norm:

Batchnorm Forward pass:
```
- fix batch_norm_mps_out key
- return 1/sqrt(var+epsilon) instead of var
- return empty tensor for mean and var if train is not enabled
- remove native_batch_norm from block list
```

Batchnorm Backward pass:
```
- add revert caculation for save_var used in backward path
- add backward test for native_batch_norm and _native_batch_norm_legit
```

Layer norm:
```
- remove the duplicate calculation from layer_norm_mps
- enable native_layer_norm backward test
- raise atol rtol for native_layer_norm
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94351
Approved by: https://github.com/razarmehr
2023-02-10 05:53:36 +00:00
2ad29009bf [MPS] Fix addmm calculation (#94534)
Ignore input when beta is 0, so that `nan` and `inf` will not be propagated.
Case already part of test_mps at https://github.com/pytorch/pytorch/blob/master/test/test_mps.py#L6308
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94534
Approved by: https://github.com/kulinseth
2023-02-10 05:05:56 +00:00
10c430ba0a Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 2a5851735ae4dc33ab4bc11c0b70d61102481f35.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/desertfire due to TIMM models start to show flaky failures after this PR, need more investigation
2023-02-10 04:40:32 +00:00
a1d210de44 Add exception handlers for stoll in jit/frontend/schema_type_parser.cpp (#94295)
Hi!

I've been fuzzing different pytorch modules, and found a few crashes.

Specifically, I'm talking about `schema_type_parser.cpp` and `irparser.cpp`. Inside these files, different standard conversion functions are used (such as `stoll`, `stoi`, `stod`, `stoull`). However, default `std` exceptions, such as `std::out_of_range`, `std::invalid_argument`, are not handled.

Some of the crash-files:

1. [crash-493db74c3426e79b2bf0ffa75bb924503cb9acdc.zip](https://github.com/pytorch/pytorch/files/10237616/crash-493db74c3426e79b2bf0ffa75bb924503cb9acdc.zip) - crash source: schema_type_parser.cpp:272

2. [crash-67bb5d34ca48235687cc056e2cdeb2476b8f4aa5.zip](https://github.com/pytorch/pytorch/files/10237618/crash-67bb5d34ca48235687cc056e2cdeb2476b8f4aa5.zip) - crash source: schema_type_parser.cpp:240

3. [crash-0157bca5c41bffe112aa01f3b0f2099ca4bcc62f.zip](https://github.com/pytorch/pytorch/files/10307970/crash-0157bca5c41bffe112aa01f3b0f2099ca4bcc62f.zip) - crash source: schema_type_parser.cpp:179

4. [crash-430da923e56adb9569362efa7fa779921371b710.zip](https://github.com/pytorch/pytorch/files/10307972/crash-430da923e56adb9569362efa7fa779921371b710.zip) - crash source: schema_type_parser.cpp:196

The provided patch adds exception handlers for `std::invalid_argument` and `std::out_of_range`, to rethrow these exceptions with `ErrorReport`.

### How to reproduce

1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/blob/master/projects/pytorch/Dockerfile)

2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .`

3. Copy crash file to the current directory

5. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash``

6. And execute the binary: `/irparser_fuzz /homedir/crash-67bb5d34ca48235687cc056e2cdeb2476b8f4aa5`

After execution completes you will see this error message:

```txt
terminate called after throwing an instance of 'std::out_of_range'
  what():  stoll
```

And this stacktrace:

```asan
==9626== ERROR: libFuzzer: deadly signal
    #0 0x5b4cf1 in __sanitizer_print_stack_trace /llvm-project/compiler-rt/lib/asan/asan_stack.cpp:87:3
    #1 0x529627 in fuzzer::PrintStackTrace() /llvm-project/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:210:5
    #2 0x50f833 in fuzzer::Fuzzer::CrashCallback() /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:233:3
    #3 0x7ffff7c3741f  (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f)
    #4 0x7ffff7a5700a in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300a)
    #5 0x7ffff7a36858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858)
    #6 0x7ffff7e74910  (/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e910)
    #7 0x7ffff7e8038b  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa38b)
    #8 0x7ffff7e803f6 in std::terminate() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa3f6)
    #9 0x7ffff7e806a8 in __cxa_throw (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa6a8)
    #10 0x7ffff7e7737d in std::__throw_out_of_range(char const*) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa137d)
    #11 0xbd0579 in long long __gnu_cxx::__stoa<long long, long long, char, int>(long long (*)(char const*, char**, int), char const*, char const*, unsigned long*, int) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/ext/string_conversions.h:86:2
    #12 0xc10f9c in std::__cxx11::stoll(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long*, int) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:6572:12
    #13 0xc10f9c in torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2::operator()() const::'lambda'()::operator()() const /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:240:25
    #14 0xc10f9c in void c10::function_ref<void ()>::callback_fn<torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2::operator()() const::'lambda'()>(long) /pytorch_fuzz/c10/util/FunctionRef.h:43:12
    #15 0xbfbb27 in torch::jit::SchemaTypeParser::parseList(int, int, int, c10::function_ref<void ()>) /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:424:7
    #16 0xc0ef24 in torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2::operator()() const /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:236:9
    #17 0xc0ef24 in void c10::function_ref<void ()>::callback_fn<torch::jit::SchemaTypeParser::parseRefinedTensor()::$_2>(long) /pytorch_fuzz/c10/util/FunctionRef.h:43:12
    #18 0xbfbb27 in torch::jit::SchemaTypeParser::parseList(int, int, int, c10::function_ref<void ()>) /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:424:7
    #19 0xbff590 in torch::jit::SchemaTypeParser::parseRefinedTensor() /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:209:3
    #20 0xc02992 in torch::jit::SchemaTypeParser::parseType() /pytorch_fuzz/torch/csrc/jit/frontend/schema_type_parser.cpp:362:13
    #21 0x9445642 in torch::jit::IRParser::parseVarWithType(bool) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:111:35
    #22 0x944ff4c in torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >*)::$_0::operator()() const /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:138:21
    #23 0x944ff4c in void std::__invoke_impl<void, torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >*)::$_0&>(std::__invoke_other, torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >*)::$_0&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
    #24 0x94463a7 in torch::jit::IRParser::parseList(int, int, int, std::function<void ()> const&) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:498:7
    #25 0x94460a5 in torch::jit::IRParser::parseOperatorOutputs(std::vector<torch::jit::VarWithType, std::allocator<torch::jit::VarWithType> >*) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:137:3
    #26 0x944c1ce in torch::jit::IRParser::parseOperator(torch::jit::Block*) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:384:3
    #27 0x944bf56 in torch::jit::IRParser::parseOperatorsList(torch::jit::Block*) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:362:5
    #28 0x9444f5f in torch::jit::IRParser::parse() /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:482:3
    #29 0x94448df in torch::jit::parseIR(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Graph*, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, torch::jit::Value*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, torch::jit::Value*> > >&) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:94:5
    #30 0x944526e in torch::jit::parseIR(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Graph*) /pytorch_fuzz/torch/csrc/jit/ir/irparser.cpp:99:3
    #31 0x5e3ebd in LLVMFuzzerTestOneInput /irparser_fuzz.cc:43:5
    #32 0x510d61 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #33 0x4fac7c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #34 0x5009cb in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #35 0x529f62 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #36 0x7ffff7a38082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #37 0x4f559d in _start (/irparser_fuzz+0x4f559d)

```

Following these steps with the remaining crashes will give you almost the same results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94295
Approved by: https://github.com/davidberard98
2023-02-10 04:37:23 +00:00
d21a7e7193 Assert TensorBox produced by lowering and add [Note: Inductor IR] (#94361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94361
Approved by: https://github.com/jansel
2023-02-10 04:29:35 +00:00
01de5ddafc add mixed data type support for LayerNorm backward on CPU (#88064)
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for LayerNorm backward is also needed for model training with LayerNorm.

### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | bf16 forward (ms) | mix forward (ms) | fp32 backward (ms) | bf16 backward (ms) | mix backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.012 | 0.012 | 0.071 | 0.065 | 0.062 |
| (8, 8, 16) | 0.015 | 0.014 | 0.015 | 0.074 | 0.070 | 0.063 |
| (32, 8, 16) | 0.062 | 0.016 | 0.016 | 0.073 | 0.073 | 0.072 |
| (64, 128, 56, 56) | 2.467 | 0.907 | 0.0897 | 12.993 | 7.603 | 7.777 |
| (64, 128, 256, 256) | 48.904 | 25.589 | 25.472 | 343.992 | 183.133 | 188.222 |

Single core(icx):
| shape | fp32 forward (ms) | bf16 forward (ms) | mix forward (ms) | fp32 backward (ms) | bf16 backward (ms) | mix backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.012 | 0.012 | 0.050 | 0.050 | 0.050 |
| (8, 8, 16) | 0.014 | 0.014 | 0.014 | 0.052 | 0.054 | 0.053 |
| (32, 8, 16) | 0.034 | 0.019 | 0.018 | 0.059 | 0.067 | 0.066 |
| (64, 128, 56, 56) | 66.791| 17.725 | 19.799 | 119.431 | 106.123 | 107.446 |
| (64, 128, 256, 256) | 1542.477 | 402.132 | 527.044 | 3019.437 | 2336.318 | 2448.320 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88064
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-02-10 03:10:14 +00:00
54fa980186 Dynamo Export use fake tensor (#94276)
This is a prerequisite for dynamo.export() to produce fine graph dynamic shape.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94276
Approved by: https://github.com/voznesenskym
2023-02-10 01:59:58 +00:00
2af89e96ec Lower libtorch build parallelization to avoid OOM (#94548)
Memory usage increases after https://github.com/pytorch/pytorch/pull/88575.  Docker crashes with exit code 137, clearly means out of memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94548
Approved by: https://github.com/seemethere
2023-02-10 01:52:09 +00:00
544c04f2df Add uint8 support for interpolate for CPU images (#90771)
Joint work with @vfdev-5

This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ).

On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`:

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms

(Note: we removed bicubic support for now)
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

```

There is still room for further speed-ups (see TODOs in the code).

#### More benchmark details

with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why.

<details>

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   5X    1.1ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   5X    1.2ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   12X   2.9ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   3X    0.8ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   7X    1.8ms vs 0.2ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   2.6X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   2.8X  0.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   1.7X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   2.7X  0.7ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   7X    1.6ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   1.8X  0.4ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   4X    1.0ms vs 0.2ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   4X    2.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   3.0X  1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   3X    1.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   4X    2.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   4X    2.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   7X    4.3ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   3X    2.1ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   4X    2.6ms vs 0.6ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   2.7X  1.6ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   2.6X  1.5ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   2.1X  1.2ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   2.8X  1.7ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   5X    2.8ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   2.3X  1.4ms vs 0.6ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   3X    1.9ms vs 0.6ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   4X    26.6ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   4X    23.9ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   2.5X  16.8ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   5X    33.1ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   4X    25.9ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   8X    59.6ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.9X  14.3ms vs 7.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   5X    35.4ms vs 7.3ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   2.0X  13.6ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   2.2X  14.8ms vs 6.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   1.3X  8.8ms vs 6.9ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.2X  8.4ms vs 6.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.8X  12.8ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   4X    32.1ms vs 7.2ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.4X  10.1ms vs 7.3ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.9X  20.9ms vs 7.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   2.1X  0.7ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   1.9X  0.6ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   1.0X  0.3ms vs 0.3ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.6X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.8X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.4X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.4X  0.5ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.2X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   1.2X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   0.9X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   4X    2.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   2.1X  1.3ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   3X    2.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   4X    2.4ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   4X    2.9ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   5X    3.1ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   3X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   4X    2.8ms vs 0.7ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   1.5X  1.0ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.2X  0.8ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   2.3X  1.5ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.9X  1.2ms vs 0.6ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.6X  1.2ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   4X    2.4ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   2.4X  1.6ms vs 0.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   2.8X  1.8ms vs 0.6ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   2.1X  12.8ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.6X  3.8ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   1.2X  7.1ms vs 6.1ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.9X  11.0ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   2.0X  12.6ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  6.1ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   1.8X  11.3ms vs 6.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  4.6ms vs 6.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.6X  9.3ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.3X  2.0ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.2X  7.2ms vs 6.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.3X  1.6ms vs 5.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.1X  7.1ms vs 6.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   0.6X  3.3ms vs 5.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   0.9X  5.9ms vs 6.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.4X  2.4ms vs 5.9ms
```

</details>

without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs)

<details>

```
AA = antialias
float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does)

input_size         output_size channels_last AA    mode       num_threads  speed-up float vs uint8 (this PR)
(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=1   0.9X  1.5ms vs 1.6ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=1   0.8X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=1   1.5X  1.7ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=1   0.9X  1.6ms vs 1.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=1   2.1X  3.9ms vs 1.9ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=1   0.8X  1.1ms vs 1.4ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=1   1.7X  2.4ms vs 1.5ms

(1, 3, 64, 64) -> (224, 224)       True    True    bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       True    False   bilinear   num_threads=2   0.9X  0.8ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   True    bilinear   num_threads=2   0.9X  0.5ms vs 0.6ms
(1, 3, 64, 64) -> (224, 224)       False   False   bilinear   num_threads=2   0.7X  0.5ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)       True    True    bicubic    num_threads=2   0.9X  0.9ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       True    False   bicubic    num_threads=2   2.1X  2.0ms vs 1.0ms
(1, 3, 64, 64) -> (224, 224)       False   True    bicubic    num_threads=2   0.8X  0.6ms vs 0.8ms
(1, 3, 64, 64) -> (224, 224)       False   False   bicubic    num_threads=2   1.7X  1.3ms vs 0.8ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=1   1.0X  3.0ms vs 3.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=1   1.0X  2.8ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=1   1.0X  2.3ms vs 2.2ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=1   1.4X  3.3ms vs 2.3ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=1   1.0X  3.5ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=1   1.7X  6.1ms vs 3.5ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=1   0.9X  2.6ms vs 2.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=1   1.4X  4.2ms vs 2.9ms

(1, 3, 224, 224) -> (270, 268)     True    True    bilinear   num_threads=2   1.0X  1.7ms vs 1.7ms
(1, 3, 224, 224) -> (270, 268)     True    False   bilinear   num_threads=2   0.9X  1.6ms vs 1.8ms
(1, 3, 224, 224) -> (270, 268)     False   True    bilinear   num_threads=2   0.9X  1.3ms vs 1.4ms
(1, 3, 224, 224) -> (270, 268)     False   False   bilinear   num_threads=2   0.7X  1.1ms vs 1.6ms
(1, 3, 224, 224) -> (270, 268)     True    True    bicubic    num_threads=2   1.0X  2.0ms vs 2.0ms
(1, 3, 224, 224) -> (270, 268)     True    False   bicubic    num_threads=2   1.7X  3.2ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   True    bicubic    num_threads=2   0.8X  1.5ms vs 1.9ms
(1, 3, 224, 224) -> (270, 268)     False   False   bicubic    num_threads=2   1.2X  2.3ms vs 1.9ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=1   1.1X  34.7ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=1   1.0X  31.2ms vs 32.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=1   1.0X  23.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=1   1.9X  42.5ms vs 22.7ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=1   0.9X  33.9ms vs 37.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=1   2.2X  84.0ms vs 37.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=1   1.0X  28.4ms vs 28.8ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=1   2.0X  56.7ms vs 28.8ms

(1, 3, 256, 256) -> (1024, 1024)   True    True    bilinear   num_threads=2   1.1X  17.5ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bilinear   num_threads=2   1.1X  17.7ms vs 16.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bilinear   num_threads=2   0.8X  8.8ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bilinear   num_threads=2   1.0X  11.1ms vs 11.4ms
(1, 3, 256, 256) -> (1024, 1024)   True    True    bicubic    num_threads=2   1.1X  19.9ms vs 18.8ms
(1, 3, 256, 256) -> (1024, 1024)   True    False   bicubic    num_threads=2   2.3X  42.5ms vs 18.7ms
(1, 3, 256, 256) -> (1024, 1024)   False   True    bicubic    num_threads=2   1.0X  14.1ms vs 14.5ms
(1, 3, 256, 256) -> (1024, 1024)   False   False   bicubic    num_threads=2   2.0X  28.4ms vs 14.5ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=1   1.0X  0.6ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=1   0.7X  0.3ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=1   0.9X  0.5ms vs 0.6ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=1   1.7X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=1   1.0X  0.8ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=1   1.1X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=1   0.9X  0.7ms vs 0.8ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=1   0.9X  0.4ms vs 0.4ms

(1, 3, 224, 224) -> (64, 64)       True    True    bilinear   num_threads=2   1.0X  0.4ms vs 0.4ms
(1, 3, 224, 224) -> (64, 64)       True    False   bilinear   num_threads=2   0.8X  0.2ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bilinear   num_threads=2   0.9X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   False   bilinear   num_threads=2   1.3X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (64, 64)       True    True    bicubic    num_threads=2   1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       True    False   bicubic    num_threads=2   1.3X  0.4ms vs 0.3ms
(1, 3, 224, 224) -> (64, 64)       False   True    bicubic    num_threads=2   0.9X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (64, 64)       False   False   bicubic    num_threads=2   1.2X  0.3ms vs 0.3ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=1   0.8X  2.1ms vs 2.5ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=1   0.7X  1.6ms vs 2.4ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=1   1.2X  2.4ms vs 2.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=1   1.3X  2.6ms vs 2.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=1   1.1X  3.4ms vs 3.0ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=1   1.7X  4.8ms vs 2.8ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=1   1.1X  2.9ms vs 2.7ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=1   1.4X  3.5ms vs 2.4ms

(1, 3, 270, 268) -> (224, 224)     True    True    bilinear   num_threads=2   0.9X  1.2ms vs 1.3ms
(1, 3, 270, 268) -> (224, 224)     True    False   bilinear   num_threads=2   1.3X  1.6ms vs 1.2ms
(1, 3, 270, 268) -> (224, 224)     False   True    bilinear   num_threads=2   0.8X  0.9ms vs 1.1ms
(1, 3, 270, 268) -> (224, 224)     False   False   bilinear   num_threads=2   1.3X  1.3ms vs 1.0ms
(1, 3, 270, 268) -> (224, 224)     True    True    bicubic    num_threads=2   1.4X  2.2ms vs 1.6ms
(1, 3, 270, 268) -> (224, 224)     True    False   bicubic    num_threads=2   1.9X  2.8ms vs 1.5ms
(1, 3, 270, 268) -> (224, 224)     False   True    bicubic    num_threads=2   0.8X  1.1ms vs 1.4ms
(1, 3, 270, 268) -> (224, 224)     False   False   bicubic    num_threads=2   1.7X  2.1ms vs 1.3ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=1   1.0X  10.0ms vs 9.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=1   0.7X  4.6ms vs 6.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=1   0.9X  9.1ms vs 9.8ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=1   1.7X  9.4ms vs 5.7ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=1   1.0X  15.2ms vs 14.8ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=1   1.0X  7.6ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=1   0.9X  13.3ms vs 14.4ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=1   0.8X  5.9ms vs 7.0ms

(1, 3, 1024, 1024) -> (256, 256)   True    True    bilinear   num_threads=2   1.2X  6.0ms vs 5.2ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bilinear   num_threads=2   0.7X  2.3ms vs 3.2ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bilinear   num_threads=2   1.0X  4.8ms vs 5.0ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bilinear   num_threads=2   0.7X  1.9ms vs 2.9ms
(1, 3, 1024, 1024) -> (256, 256)   True    True    bicubic    num_threads=2   1.6X  12.3ms vs 7.5ms
(1, 3, 1024, 1024) -> (256, 256)   True    False   bicubic    num_threads=2   1.0X  3.9ms vs 3.9ms
(1, 3, 1024, 1024) -> (256, 256)   False   True    bicubic    num_threads=2   1.0X  7.0ms vs 7.3ms
(1, 3, 1024, 1024) -> (256, 256)   False   False   bicubic    num_threads=2   0.9X  3.0ms vs 3.5ms

```

</details>

Benchmark code
<details>

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu')

        if channels_last:
            input_image = input_image.contiguous(memory_format=torch.channels_last)

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "antialias": antialias,
            "dtype":dtype,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, antialias, dtype):
        if dtype == torch.float:
            input_image = input_image.float()

        out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias)
        if dtype == torch.float:
            out = out.round().clamp(min=0, max=256).to(torch.uint8)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((270, 268), (224, 224)),
        ((256, 256), (1024, 1024)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        # attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        # attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True, False],
            'mode': ["bilinear", "bicubic"],
            'antialias': [True, False],
            # 'dtype': [torch.float, torch.uint8]
            # 'dtype': [torch.uint8]
            'dtype': [torch.float]
        },
        tags=["short"],
    )

    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()

```

```py
import re
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("f1", nargs="?", default="main")
parser.add_argument("f2", nargs="?", default="new")
args = parser.parse_args()

with open(args.f1) as f:
    main = f.readlines()
with open(args.f2) as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    # num_threads=1  # TODO: remove
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("antialias=", "")
        deets = deets.replace("channels_last=", "")
        # deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")

        # size = ','.join(split[:-3])
        # mode, dtype, threads = split[-3:]
        # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        size = ','.join(split[:-5])
        channels_last, mode, antialias, dtype, threads= split[-5:]
        deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        # assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 8 == 0:
        print()
    # if i % 10 == 0 and i % 40 != 0:
    #     print()
    # if i % 40 == 0:
    #     print("-" * 100)
    print(l)

```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90771
Approved by: https://github.com/peterbell10, https://github.com/ngimel
2023-02-10 01:43:54 +00:00
782e4f5c02 [quant] Add quantize and dequantize operators to decomposition table (#93312)
Summary:
This PR tries to decompose the operators in torch.ops.quantized_decomposed namespace to more
primitive aten operators, this would free us from maintaining the semantics of the quantize/dequantize
operators, which can be expressed more precises in terms of underlying aten operators

Note: this PR just adds them to the decomposition table, we haven't enable this by default yet

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_q_dq_decomposition

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93312
Approved by: https://github.com/vkuzo, https://github.com/SherlockNoMad
2023-02-10 01:40:12 +00:00
df13247e67 small bugfixes to release notes script (#94536)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94536
Approved by: https://github.com/drisspg
2023-02-10 01:23:07 +00:00
93ee1bf168 [inductor] Fix a conv stride assertion (#94405)
Summary: The issue appears when _inductor.config.tune_layout is set. If
we pick a different aten convolution memory format, we need to transform
its input layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94405
Approved by: https://github.com/jansel
2023-02-10 01:02:04 +00:00
f5ccbc1704 Ignore 7z locked usage log error on Windows non-ephemeral runners (#94483)
This is the second times I spot this error on the new Windows non-ephemeral runners, so let's get it fixed.

The error https://github.com/pytorch/pytorch/actions/runs/4130018165/jobs/7136942722 was during 7z-ing the usage log artifact on the runners:

```
WARNING: The process cannot access the file because it is being used by another process.
usage_log.txt
```

The locking process is probably the monitoring script.  This looks very similar to the issue on MacOS pet runners in which the monitoring script is not killed sometime.

I could try to kill the process to unlock the file.  But then not being able to upload the usage log here is arguably ok too.  So I think it would be easier to just ignore the locked file and move on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94483
Approved by: https://github.com/clee2000
2023-02-10 00:58:36 +00:00
016f0b2f62 [MPS] Calculate nonzero count inside nonzero op (#94442)
Calculate nonzero count directly in the nonzero op.
Additionally, synchronize before entering nonzero op to make sure all previous operations finished (output shape is allocated based on the count_nonzero count)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94442
Approved by: https://github.com/kulinseth
2023-02-10 00:53:52 +00:00
4c6a7faec5 [Profiler] Use RAII wrapper to manage refcounts during python tracer startup. (#91646)
Refcounting is hard. (Citation needed.) https://github.com/pytorch/pytorch/pull/81242 introduced a corner case where we would over incref when breaking out due to max (128) depth. https://github.com/pytorch/pytorch/pull/85847 ostensibly fixed a segfault, but in actuality was over incref-ing because PyEval_GetFrame returns a borrowed reference while `PyFrame_GetBack` returns a strong reference.

Instead of squinting really hard at the loops, it's much better to use the RAII wrapper and do the right thing by default.

I noticed the over incref issue because of a memory leak where Tensors captured by the closure of a function would be kept alive by zombie frames.

Differential Revision: [D42184394](https://our.internmc.facebook.com/intern/diff/D42184394/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91646
Approved by: https://github.com/albanD
2023-02-10 00:28:18 +00:00
336d9354d6 [MPS] Enable index add for TestConsistency (#94356)
Enable index_add TestConsistency TestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94356
Approved by: https://github.com/kulinseth
2023-02-10 00:21:11 +00:00
299ada9cff [MPS] Add the floor_divide fixes. (#94488)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94488
Approved by: https://github.com/razarmehr
2023-02-10 00:10:08 +00:00
93d7d546ff Fix saved tensor hooks to propogate errors back to python as-is (#94456)
Mitigates the effect of https://github.com/pytorch/pytorch/issues/34172 for saved tensor hooks

BC Breaking message:
- Exceptions raised inside the pack and unpack hooks are no longer erroneously converted to RuntimeErrors. You should update your code to handle the original type of exception raised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94456
Approved by: https://github.com/albanD
2023-02-09 23:52:06 +00:00
2a5851735a Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-09 23:43:13 +00:00
79ed6b246c Mark ROCm trunk job as unstable (#94550)
Failing to access AMD apt repo 09598b603f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94550
Approved by: https://github.com/clee2000
2023-02-09 23:20:00 +00:00
2394e6baa9 [quant][fx] Change prepare_fx and convert_fx to preserve the GraphModule type of input (#94412)
Summary:
Previously prepare_fx returns an ObservedGraphModule and convert_fx returns a QuantizedGraphModule,
this is to preserve the attributes since torch.fx.GraphModule did not preserve them, after https://github.com/pytorch/pytorch/pull/92062
we are preserving `model.meta`, so we can store the attributes in model.meta now to preserve them.

With this, we don't need to create a new type of GraphModule in these functions and can use GraphModule directly, this
is useful for quantization in pytorch 2.0 flow, if other transformations are using GraphModule as well, the quantization passes will be composable with them

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels
python test/test_quantization.py TestQuantizePT2E

Imported from OSS

Differential Revision: D42979722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94412
Approved by: https://github.com/vkuzo
2023-02-09 23:03:23 +00:00
09598b603f [dtensor] update readme for prototype release (#94517)
This PR updates the README for prototype release, remove some code
that are not available yet and use the ones that works.

Also rename to DTensor in most sentences
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94517
Approved by: https://github.com/fegin
2023-02-09 22:35:26 +00:00
66bfcd32fd [ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725)
Fixes #64427.  MIOpen supports ChannelsLast.  No longer need to opt-in with env var.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90725
Approved by: https://github.com/malfet
2023-02-09 22:26:24 +00:00
c1e2704656 ao migration: fix broken import, try 2 (#94458)
Summary:

https://github.com/pytorch/pytorch/pull/94170 broke some Meta-only tests because it broke the following syntax:

```
import torch.nn.intrinsic

_ = torch.nn.intrinsic.quantized.dynamic.*
```

This broke with the name change because the `ao` folder is currently doing lazy import loading, but the original folders are not.

For now, just unbreak the folders needed for the tests to pass. We will follow-up with ensuring this doesn't break for other folders in a future PR.

Test plan:

```
python test/test_quantization.py -k AOMigrationNNIntrinsic.test_modules_no_import_nn_intrinsic_quantized_dynamic
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94458
Approved by: https://github.com/jerryzh168
2023-02-09 22:20:01 +00:00
bebe58bd71 [DCP] Set single_file_per_rank default to True (#94501)
The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501
Approved by: https://github.com/fegin
2023-02-09 21:45:31 +00:00
54b7c7d5e9 Added requested_bytes to CUDA Caching Allocator Stats (#88575)
Summary:
The caching allocator can be configured to round memory allocations in order to reduce fragmentation. Sometimes however, the overhead from rounding can be higher than the fragmentation it helps reduce.

We have added a new stat to CUDA caching allocator stats to help track if rounding is adding too much overhead and help tune the roundup_power2_divisions flag:
    - "requested_bytes.{current,peak,allocated,freed}": memory requested by client code, compare this with allocated_bytes to check if allocation rounding adds too much overhead

Test Plan: Added test case in caffe2/test/test_cuda.py

Differential Revision: D40810674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88575
Approved by: https://github.com/zdevito
2023-02-09 21:37:25 +00:00
dddc0b41db [ROCm] centos update endpoint repo and fix sudo (#92034)
* Update ROCm centos Dockerfile
* Update install_user.sh for centos sudo issue

Fixes ROCm centos Dockerfile due to https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm file is not accessible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92034
Approved by: https://github.com/malfet
2023-02-09 21:30:58 +00:00
dd315e5c06 Dynamo: Support ConstantVariable (comparison_op) SymNodeVariable (#94519)
Expands the generic compare logic to handle SymNodeVariables on the right side of the expression.
Also adds support for `>=`, which it appears was mistakenly left out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94519
Approved by: https://github.com/jansel
2023-02-09 21:17:17 +00:00
88e16849db [pt2] Fix multiple races in log folder (#93407)
Summary:
There are a few races/permission errors in file creation, fixing
OSS:
1. caffe2/torch/_dynamo/utils.py, get_debug_dir: multiple process may conflict on it even it's using us. Adding pid to it
2. caffe2/torch/_dynamo/config.py: may not be a right assumption that we have permission to cwd

Test Plan: sandcastle

Differential Revision: D42905908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93407
Approved by: https://github.com/soumith, https://github.com/mlazos
2023-02-09 21:10:14 +00:00
444829fa21 [nn] Remove deprecated torch.nn.utils._stateless (#94498)
Follows https://github.com/pytorch/pytorch/pull/92536#discussion_r1097578900. There have been 10 months since `torch.nn.utils._stateless` was marked as deprecated.

This PR also changes `tie_weights` in `_reparametrize_module` to kw-only argument. Since it is private API and only imported by `torch.nn.utils._stateless` (removed).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94498
Approved by: https://github.com/jbschlosser
2023-02-09 20:53:40 +00:00
f45c196653 Update backend config to be under _World (#94191)
All the c10d process group state is under `_World`, so this is BE work to include a missing map
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94191
Approved by: https://github.com/kumpera
2023-02-09 20:48:42 +00:00
98d3612e48 [Profiler] Enable SOFT_ASSERT to log Invariant Violation to Kineto (#92872)
Summary: Record the Soft assert to Kineto.

Test Plan: Internal CI Tests.

Differential Revision: D42219145

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92872
Approved by: https://github.com/robieta
2023-02-09 20:36:25 +00:00
92620aface [DCP]Update optimizer.py docstring (#94379)
Update load_sharded_optimizer_state_dict() docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94379
Approved by: https://github.com/fduwjj
2023-02-09 20:24:28 +00:00
760836f738 Add back in registration (#94452)
Summary: Need to re-register the underscored function in order to have the op present in predictor. This is because older models have been exported with the underscored version.

Test Plan: See if predictor tests pass?

Reviewed By: cpuhrsch

Differential Revision: D43138338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94452
Approved by: https://github.com/cpuhrsch
2023-02-09 20:18:19 +00:00
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
a63524684d [ONNX] Add col2im for opset 18 (#84594)
Opset 18 will be used to introduce suport for ONNX's Col2Im-18 and resolve https://github.com/pytorch/pytorch/issues/84408

Depends: https://github.com/pytorch/pytorch/pull/83201 (CI will fail until ONNX submodule is updated)

as per Faith recommendation, this PR should be merged post ORT 1.13 only
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84594
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/abock, https://github.com/BowenBao
2023-02-09 19:54:42 +00:00
ea98ba02e2 Prevent duplicate symbol for dsa_add_new_assertion_failure (#94064)
`dsa_add_new_assertion_failure` is currently causing duplicate definition issues. Possible solutions:
1. Put the device code in a .cu file - requires device linking, which would be very painful to get setup.
2. inline the code - could cause bloat, especially since a function might include many DSAs.
3. Anonymous namespace - balances the above two. Putting the code in a .cu file would ensure that there's a single copy of the function, but it's hard to setup. Inlining the code would cause bloat. An anonymous namespace is easy to setup and produces a single copy of the function per translation unit, which allows the function to be called many times without bloat.

Differential Revision: D42998295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94064
Approved by: https://github.com/ezyang
2023-02-09 19:47:36 +00:00
6007874bbb Revert "teach inductor to handle floor (#94341)"
This reverts commit e7df9aaec83648445f6cae3412b5b4038fbbe400.

Reverted https://github.com/pytorch/pytorch/pull/94341 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the CudaTest failure looks related.  It fails on both PR and trunk e7df9aaec8
2023-02-09 19:31:08 +00:00
f35f12320a [MPS] Fixes for arange_mps for empty tensor. (#94485)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94485
Approved by: https://github.com/razarmehr
2023-02-09 19:30:17 +00:00
105f7205bd [MPS] Fix and unblock TestConsistency for median (#94489)
- fix num_output_dims calculation
- fix median_out_mps key
- cast tensor sent to sortWithTensor and argSortWithTensor
- note down same issue for unique
- unblock median from blocklist
- adding test_median_int16 test

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94489
Approved by: https://github.com/razarmehr
2023-02-09 19:29:07 +00:00
69e0bda999 [BE] Import Literal, Protocol, and Final from standard library typing as of Python 3.8+ (#94490)
Changes:

1. `typing_extensions -> typing-extentions` in dependency. Use dash rather than underline to fit the [PEP 503: Normalized Names](https://peps.python.org/pep-0503/#normalized-names) convention.

```python
import re

def normalize(name):
    return re.sub(r"[-_.]+", "-", name).lower()
```

2. Import `Literal`, `Protocal`, and `Final` from standard library as of Python 3.8+
3. Replace `Union[Literal[XXX], Literal[YYY]]` to `Literal[XXX, YYY]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94490
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-09 19:17:49 +00:00
527b646f4b Refactor to extract label_utils from export_pytorch_labels (#94179)
Part of fixing #88098

## Context

This is 1/3 PRs to address issue 88098 (move label check failure logic from `check_labels.py` workflow to `trymerge.py` mergebot. Due to the messy cross-script imports and potential circular dependencies, it requires some refactoring to the scripts before, the functional PR can be cleanly implemented.

## What Changed
1. Extract extracts label utils fcns to a `label_utils.py` module from the `export_pytorch_labels.py` script.
2. Small improvements to naming, interface and test coverage

## Note to Reviewers
This series of PRs is to replace the original PR https://github.com/pytorch/pytorch/pull/92682 to make the changes more modular and easier to review.

* 1st PR: this one
* 2nd PR: https://github.com/Goldspear/pytorch/pull/2
* 3rd PR: https://github.com/Goldspear/pytorch/pull/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94179
Approved by: https://github.com/ZainRizvi
2023-02-09 19:17:05 +00:00
4f691d2e2f [MPS] Fix correctness issue with fill_scalar_mps() (#94479)
- The self was not contiguous and inline filling produced wrong results
- Added a test case for the issue

Fixes the zero_like() issue reported in #94190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94479
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2023-02-09 19:07:13 +00:00
75545798c6 test_inductor test.sh fix (#92833)
inductor/test_torchinductor suite is not running as part of the CI. I have triaged this down to a bug in the arguments supplied in test/run_test.py

Currently test_inductor runs the test suites as:
`PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include inductor/test_torchinductor --include inductor/test_torchinductor_opinfo --verbose`

Which will only set off the test_torchinductor_opinfo suite

Example from CI logs: https://github.com/pytorch/pytorch/actions/runs/3926246136/jobs/6711985831#step:10:45089
```
+ PYTORCH_TEST_WITH_INDUCTOR=0
+ python test/run_test.py --include inductor/test_torchinductor --include inductor/test_torchinductor_opinfo --verbose
Ignoring disabled issues:  []
/var/lib/jenkins/workspace/test/run_test.py:1193: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6":
Selected tests:
 inductor/test_torchinductor_opinfo
Prioritized test from test file changes.
reordering tests for PR:
prioritized: []
the rest: ['inductor/test_torchinductor_opinfo']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92833
Approved by: https://github.com/seemethere
2023-02-09 18:51:25 +00:00
81853354c3 added aten.log_normal_ decomp (#91674)
Fixes #91275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91674
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2023-02-09 18:34:25 +00:00
b2ea1d06aa Collective dispatching from Process Group (#91257)
Fixes https://github.com/pytorch/pytorch/issues/90932
Fixes https://github.com/pytorch/pytorch/issues/90659

Remove redundant collection operation definitions by calling the ops directly from `ProcessGroup`

Context:
https://github.com/pytorch/pytorch/issues/86225

Differential Revision: [D42854676](https://our.internmc.facebook.com/intern/diff/D42854676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91257
Approved by: https://github.com/kwen2501
2023-02-09 18:31:28 +00:00
31c30134bb [MPS] Raise error for Conv3D as currently we don't have support. (#94492)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94492
Approved by: https://github.com/razarmehr
2023-02-09 18:28:11 +00:00
1dd6c8176c Doc Fix: Update _symbolic_trace.py (#94510)
Use `::` to activate the code block. Currently the code below is not rendered as code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94510
Approved by: https://github.com/H-Huang
2023-02-09 18:11:09 +00:00
490c8f67c5 Revert "WIP: don't call floor for symint unless necessary (#94365)"
This reverts commit 8a9ea44985725e57cb82f0d978fafae31577ae6d.

Reverted https://github.com/pytorch/pytorch/pull/94365 on behalf of https://github.com/ZainRizvi due to This looks like it caused some inductor test to start failing: 8a9ea44985
2023-02-09 17:42:23 +00:00
e7df9aaec8 teach inductor to handle floor (#94341)
Per title, happen when there's upsampling with non-integer scale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94341
Approved by: https://github.com/ezyang
2023-02-09 17:09:35 +00:00
685108b201 [docs] Fix incorrect wrapping of function (#94446)
The sample code of document incorrectly wraps the function decorator. To fix this, update the attributes of `func` based on `torch_function`.

Fixes #94305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94446
Approved by: https://github.com/ezyang
2023-02-09 16:01:10 +00:00
47efbd5719 [pytorch] [hygiene] remove legacy buck rules (#94053)
Summary:
Removes legacy buck rules specifically we do the following conversions
- ["xxx:=yyy"] -> ["xxx[yyy]"]
- "//xxx/yyy" - "//xxx/yyy:yyy"

Test Plan: CI should pass

Differential Revision: D42999413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94053
Approved by: https://github.com/osalpekar, https://github.com/malfet
2023-02-09 15:45:29 +00:00
4f3858c6d8 [functorch] linearize (#94173)
Fixes https://github.com/pytorch/functorch/issues/724

TODO:
* [x] Docs

NOTE: `const_fold` pass raises UserWarning -> https://github.com/pytorch/pytorch/issues/94374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94173
Approved by: https://github.com/Chillee
2023-02-09 15:45:08 +00:00
a5b052259b Add MPS support for aten::remainder.Tensor_out (#92139)
Fixes #86806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92139
Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97
2023-02-09 15:32:30 +00:00
4e1bd4abe7 Fix scalar type resolution for optional tensor (#94427)
When TorchScript Value has an optional tensor, `dtype()` or `scalarType()` is not available and raise (by design).

The symbolic `_op_with_optional_float_cast` must check whether the tensor is otpional or not before calling the scalar type resolution API. This PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94427
Approved by: https://github.com/abock, https://github.com/shubhambhokare1
2023-02-09 15:22:02 +00:00
76ed1a81d1 Revert "COO intersection kernel: respect value intersection order (#92242)"
This reverts commit b07c839b707761b677bf2d729a4d9b13dd2beabe.

Reverted https://github.com/pytorch/pytorch/pull/92242 on behalf of https://github.com/jeanschmidt due to breaking vs17
2023-02-09 14:44:32 +00:00
f165be5a49 tuned best BS with inductor on cpu for E2E models (#94181)
Add 3 more batch size files for Torchbench/Huggingface/TIMMs suites which tuned on Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz.

Fixes #94180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94181
Approved by: https://github.com/ezyang
2023-02-09 13:32:57 +00:00
a81cf49d97 Remove dead functions (#94415)
CR from https://github.com/pytorch/pytorch/pull/94307

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94415
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2023-02-09 12:37:56 +00:00
e4fe11eecb [MPS] Fix torch.topk for empty tensors and k=0 on mps (#91884)
Fixes #91878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91884
Approved by: https://github.com/kulinseth
2023-02-09 10:42:52 +00:00
19264b50bb [MPS] Add support for nansum on mps (#93845)
* Add `nansum_out_mps` and `nansum_mps` functions
* Moved `get_dtype_from_self` into ReduceOpsUtils.h

Fixes #86809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93845
Approved by: https://github.com/malfet
2023-02-09 10:30:55 +00:00
8a9ea44985 WIP: don't call floor for symint unless necessary (#94365)
Per @ezyang's advice, added magic sym_int method. This works for 1.0 * s0 optimization, but can't evaluate `a>0` for some args, and still misses some optimization that model rewrite achieves, so swin still fails
(rewrite replaces `B = int(windows.shape[0] / (H * W / window_size / window_size))` with `B = (windows.shape[0] // int(H * W / window_size / window_size))` and model passes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94365
Approved by: https://github.com/ezyang
2023-02-09 10:05:49 +00:00
8b37eff69f remove abi uncertainty and potential abi conflict (#94306)
Currently there is a potential conflict for `GLIBCXX_USE_CXX11_ABI` configuration if users don't explicitly set this variable.
In `caffe2/CMakeLists.txt`, if the variable is not set, an `abi checker` will be used to retrieve the ABI configuration from compiler.
https://github.com/pytorch/pytorch/blob/master/caffe2/CMakeLists.txt#L1165-L1183
However, in 'torch/csrc/Module.cpp`, if the variable is not set, it will be set to `0`. The conflict happens when the default ABI of the compiler is `1`.
https://github.com/pytorch/pytorch/blob/master/torch/csrc/Module.cpp#L1612

This PR eliminate this uncertainty and potential conflict.
The ABI will be checked and set in `CMakeLists.txt`, and pass the value to `caffe2/CMakeLists.txt`. Meanwhile, in case the `caffe2/CMakeLists.txt` is directly invoked from a `cmake` command, The original GLIBC check logic is kept in this file.
If users doesn't explicitly assign a value to `GLIBCXX_USE_CXX11_ABI`, the `abi checker` will be executed and set the value accordingly. If the `abi checker` failed to compile or execute, the value will be set to `0`. If users explicitly assigned a value, then the provided value will be used.

Moreover, if `GLIBCXX_USE_CXX11_ABI` is set to `0`, the '-DGLIBCXX_USE_CXX11_ABI=0' flag won't be appended to `CMAKE_CXX_FLAGS`. Thus, whether to use ABI=0 or ABI=1 fully depends on compiler's default configuration. It could cause an issue that even users explicitly set `GLIBCXX_USE_CXX11_ABI` to `0`, the compiler still builds the binaries with ABI=1.
https://github.com/pytorch/pytorch/blob/master/CMakeLists.txt#L44-L51
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94306
Approved by: https://github.com/malfet
2023-02-09 09:54:04 +00:00
02ca2253cc [MPS] Fixes for Binary ops with casting issues from FP to uint8 (#94382)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94382
Approved by: https://github.com/razarmehr
2023-02-09 09:44:02 +00:00
e0e4f1a890 Revert "[functorch] linearize (#94173)"
This reverts commit b6b9e1e6e043ae4b9f41fbbee4f2a9e9a7e7d3d7.

Reverted https://github.com/pytorch/pytorch/pull/94173 on behalf of https://github.com/kshitij12345 due to Broke lint runner
2023-02-09 09:22:39 +00:00
b6b9e1e6e0 [functorch] linearize (#94173)
Fixes https://github.com/pytorch/functorch/issues/724

TODO:
* [x] Docs

NOTE: `const_fold` pass raises UserWarning -> https://github.com/pytorch/pytorch/issues/94374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94173
Approved by: https://github.com/Chillee
2023-02-09 08:57:05 +00:00
81e318353f Align input memory format and grad memory format for GroupNorm backward (#92668)
Fixes the skipped part of the test on https://github.com/pytorch/pytorch/pull/92671. Align the input memory format and the grad memory format for GroupNorm backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92668
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-02-09 08:56:43 +00:00
81bbee7d7e [SDPA] Adds basic correctness checks (#94274)
# Summary
Add more checks around shape constraints as well as update the sdp_utils to properly catch different head_dims between qk and v for flash_attention which is not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94274
Approved by: https://github.com/cpuhrsch
2023-02-09 08:05:26 +00:00
92f569fe11 [Inductor] added aten.geometric_ decomp (#91672)
Fixes #91671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91672
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2023-02-09 07:29:14 +00:00
c028fc4e25 Decouple PT2 dynamic shapes from the functorch setting (#94469)
The functorch setting still exists, but now it is no longer necessary:
we infer use of Python dispatcher by checking if the ambient
FakeTensorMode has a ShapeEnv or not.  The setting still exists,
but it is for controlling direct AOTAutograd use now; for PT2,
it's sufficient to use torch._dynamo.config.dynamic_shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94469
Approved by: https://github.com/Chillee, https://github.com/voznesenskym, https://github.com/jansel
2023-02-09 06:41:41 +00:00
c82bb28759 Update autocast policy list on CPU (#92527)
Update autocast policy list on CPU. It depends on #92530.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92527
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
2023-02-09 06:40:56 +00:00
2180a0dc0c [FSDP][optim_state_dict] Remove the dead code (#94448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94448
Approved by: https://github.com/awgu
2023-02-09 06:32:40 +00:00
af5b09182a [PT-D] Update torch.distributed code owners (#94362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94362
Approved by: https://github.com/fduwjj
2023-02-09 05:33:01 +00:00
11f51e798f Upgrade nightly wheels to ROCm5.4.2 (#93090)
Test PR1225: https://github.com/pytorch/builder/pull/1225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93090
Approved by: https://github.com/atalman
2023-02-09 04:53:11 +00:00
cb715c26e2 [MPS] Replace the explicit commit in View ops with adaptive commit (#94218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94218
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2023-02-09 04:10:59 +00:00
6d722dba0f [ONNX] Update CI onnx and ORT version (#94439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94439
Approved by: https://github.com/BowenBao
2023-02-09 04:08:38 +00:00
03b9569d2c [vision hash update] update the pinned vision hash (#94455)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94455
Approved by: https://github.com/pytorchbot
2023-02-09 04:03:11 +00:00
bc26890bbe [inductor] Fix args in sink_cat_after_pointwise (#94416)
Summary:
Silly me, I did not realize that dim could be a regular arg as well as
a kwarg in this pass.

Test Plan: New unit test.

Differential Revision: D43098594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94416
Approved by: https://github.com/jansel
2023-02-09 03:40:08 +00:00
fe00722539 Revert "feat(fx): make_fx should be aware of functions wrapped with @fx.wrap (#93273)"
This reverts commit 6a4bf3b71bf28ee6d1feb9608d59c27e3636232c.

Reverted https://github.com/pytorch/pytorch/pull/93273 on behalf of https://github.com/ezyang due to nervous about this before branch cut. lets take our time post branch cut
2023-02-09 03:33:09 +00:00
41e3189222 [PT-D][Tensor parallelism] Add documentations for TP (#94421)
This is far from completed and we will definitely polish it down the road.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94421
Approved by: https://github.com/wz337
2023-02-09 02:31:06 +00:00
5b8e485a34 [MPS] Add 2d grid sampler (#94273)
Add support for MPS grid sampler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94273
Approved by: https://github.com/razarmehr
2023-02-09 02:25:46 +00:00
6c80d0a5a5 [MPS] Fix correctness issues with Pool2D ops (#94348)
- Fix wrong results in AvgPool2D when `count_include_pad=True`
- Fix issues with adaptive average and max pool2d
- Remove the redundant blocking copies from `AdaptiveMaxPool2d`
- Add `divisor` to cached string key to avoid conflicts
- Add test case when both `ceil_mode` and `count_include_pad` are True (previously failed).
- Clean up redundant code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94348
Approved by: https://github.com/kulinseth
2023-02-09 02:06:40 +00:00
ca63040d2b Revert "Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)"
This reverts commit 7bfc59993d25c444eccb6cd77e85e4dd0a348b7e.

Reverted https://github.com/pytorch/pytorch/pull/94363 on behalf of https://github.com/huydhn due to This change fails in trunk 7bfc59993d running out of memory.  Mark this as weird because it was green in PR
2023-02-09 01:24:35 +00:00
bb48d90b00 [Executorch][Quant][BE] Refactor Choose_Qparams (#94338)
Summary: Refactor so that it can be decomposed

Test Plan: ci

Differential Revision: D42681268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94338
Approved by: https://github.com/jerryzh168
2023-02-09 01:20:17 +00:00
1e2d82b8e4 [BE] Merge isinstance calls together (#94419)
Simplify and speeds up isinstance calls by checking for multiple types at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419
Approved by: https://github.com/ezyang
2023-02-09 00:47:26 +00:00
f9cc12eebd Remove duplicate CI jobs between pull and trunk (#94426)
These configs are already in the pull settings and so run on trunk.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94426
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-02-09 00:19:20 +00:00
5ea6f59875 Update xla image tag (#94377)
Follow up, https://github.com/pytorch/xla/pull/4584 to support CUDA 11.7 and sccahe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94377
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-02-09 00:17:37 +00:00
66ae3aa096 [Inductor] added aten.cauchy_ decomp (#92047)
Fixes #91675

TODO: compare perf of decomposed tan --vs-- libdevice tan, aten tan for triton, cpp backeneds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92047
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel
2023-02-09 00:02:56 +00:00
0ce95c3a17 Dynamo: Support min / max over iterables (#94350)
Expands support for built-in `min` and `max` calls beyond binary to iterables - simply reduce over the existing binary logic.
Adds support for:
* lists
* tuples
* list iterators
* vararg min / max - `min(2, 3, 4)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94350
Approved by: https://github.com/voznesenskym, https://github.com/ezyang
2023-02-09 00:02:40 +00:00
53a5c8c7cb Avoid guarding on zero-ness with meta tensors. (#94399)
This removes one of the == 0 tests that occur when you construct a tensor with SymInts. Unfortunately there are more, so I can't test this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94399
Approved by: https://github.com/albanD
2023-02-09 00:00:44 +00:00
dc70b00d0b Track and record hint on SymNode and use when possible (#94201)
Historically, we work out `size_hint` by working it out on the fly by doing a substitution on the sympy expression with the `var_to_val` mapping. With this change, we also maintain the hint directly on SymNode (in `expr._hint`) and use it in lieu of Sympy substitution when it is available (mostly guards on SymInt, etc; in particular, in idiomatic Inductor code, we typically manipulate Sympy expressions directly and so do not have a way to conveniently maintain hints.)

While it's possible this will give us modest performance improvements, this is not the point of this PR; the goal is to make it easier to carefully handle unbacked SymInts, where hints are expected not to be available. You can now easily test if a SymInt is backed or not by checking `symint.node.hint is None`.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94201
Approved by: https://github.com/voznesenskym
2023-02-09 00:00:44 +00:00
b5ef37b9a4 Dynamo: Fix graph break when iterating over tensor (#94326)
Supports the following with dynamic shapes:
```python
for element in tensor:
    # do stuff with element
```

Approach follows what's done when `call_range()` is invoked with dynamic shape inputs: guard on tensor size and continue tracing with a real size value from `dyn_dim0_size.evaluate_expr()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94326
Approved by: https://github.com/ezyang
2023-02-08 23:57:06 +00:00
7bfc59993d Set torch.backends.cudnn.enabled to false when testing accuracy (#94363)
Summary: It looks like setting torch.backends.cudnn.deterministic to
True is not enough for eliminating non-determinism when testing
benchmarks with --accuracy, so let's turn off cudnn completely.
With this change, mobilenet_v3_large does not show random failure on my
local environment. Also take this chance to clean up CI skip lists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94363
Approved by: https://github.com/ezyang
2023-02-08 23:30:10 +00:00
04b06c9627 [ONNX] Use optional op to keep None in results for ONNX internal tests (#84789)
All this time, PyTorch and ONNX has different strategy for None in output. And in internal test, we flatten the torch outputs to see if the rest of them matched. However, this doesn't work anymore in scripting after Optional node is introduced, since some of None would be kept.

#83184 forces script module to keep all Nones from Pytorch, but in ONNX, the model only keeps the ones generated with Optional node, and deletes those meaningless None.

This PR uses Optional node to keep those meaningless None in output as well, so when it comes to script module result comparison, Pytorch and ONNX should have the same amount of Nones.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84789
Approved by: https://github.com/BowenBao
2023-02-08 23:04:47 +00:00
b27ac6dc56 [ONNX] Add full checker mode in torch.onnx.export (#83186)
Fix #82589
Why:
1. **full_check** works in `onnx::checker::check_model` function as it turns on **strict_mode** in `onnx::shape_inference::InferShapes()` which I think that was the intention of this part of code.
2. **strict_mode** catches failed shape type inference (invalid ONNX model from onnx perspective) and ONNXRUNTIME can't run these invalid models, as ONNXRUNTIME actually rely on ONNX shape type inference to optimize ONNX graph. Why we don't set it True for default? >>> some of existing users use other platform, such as caffe2 to run ONNX model which doesn't need valid ONNX model to run.
3. This PR doesn't change the original behavior of `check_onnx_proto`, but add a warning message for those models which can't pass strict shape type inference, saying the models would fail on onnxruntime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83186
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi, https://github.com/jcwchen, https://github.com/BowenBao
2023-02-08 22:47:25 +00:00
4e984cb614 [dynamo 3.11] changes to python code object (#93985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93985
Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/voznesenskym
2023-02-08 22:44:23 +00:00
021d267694 update aten op overload to not use from to avoid compile errors (#89797)
Fix for https://github.com/pytorch/pytorch/issues/93591 by changing `random_.from` to `random_.from_int`.

The previous signature would fail when printed in an fx graph, because `from` is a reserved python keyword. This change affects serialization but I have added an adapter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89797
Approved by: https://github.com/tugsbayasgalan
2023-02-08 22:04:59 +00:00
f2156ef42b Make triton debug util reusable (#94225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94225
Approved by: https://github.com/Chillee
2023-02-08 22:03:35 +00:00
22e1698cf7 [MPS] Add triangular solve op through MPSMatrixSolveTriangular (#94345)
Add triangular solve op support through MPS `MPSMatrixSolveTriangular` kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94345
Approved by: https://github.com/razarmehr
2023-02-08 21:48:12 +00:00
82401c6a69 [BE] Set PYTORCH_TEST_WITH_INDUCTOR only once (#94411)
Setting the same env-var twice should have no effect, unless one is trying mini rowhammer here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94411
Approved by: https://github.com/jeanschmidt, https://github.com/huydhn, https://github.com/Skylion007
2023-02-08 21:00:40 +00:00
0bf78b57c0 fix: max_unpool3d buffer overflow (#94372)
Fixes #88032

Previously `output_size` is accessed before the shape length check, which leads to a buffer overflow issue.

The fix is simply to prioritize the check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94372
Approved by: https://github.com/albanD
2023-02-08 19:48:25 +00:00
3a5a762443 Revert "[quant] Add quantize and dequantize operators to decomposition table (#93312)"
This reverts commit 3fd46a2f9c56c692b242727cb146cfd464210c6a.

Reverted https://github.com/pytorch/pytorch/pull/93312 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks trunk due to a landrace 3fd46a2f9c.  Please rebase and re-land it
2023-02-08 18:29:10 +00:00
6ac0198c02 [CI] Add known ciflow labels to probot (#94368)
Add `collect_ciflow_labels.py` that automatically extracts all labels from workflow files and adds the to pytorch-probot.yml
Same script can also be used to validate that all tags are referenced in the config

Add this validation to quickchecks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94368
Approved by: https://github.com/jeanschmidt
2023-02-08 17:37:27 +00:00
c0fe5fb987 [BE] Doc Update: Python 3.7 is past End of Life (#94314)
Python 3.7 is no longer supported
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94314
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-02-08 17:34:45 +00:00
b8de1cf007 [functorch][nn] Refactor NN stateless APIs by swapping module tensors (#92536)
- Fixes #92295
- Resolves #86708
- Resolves #92153
- Closes #92401
- Closes #92218

- Requires #91579

Refactor NN stateless APIs by swapping module tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92536
Approved by: https://github.com/jbschlosser
2023-02-08 17:31:38 +00:00
3fd46a2f9c [quant] Add quantize and dequantize operators to decomposition table (#93312)
Summary:
This PR tries to decompose the operators in torch.ops.quantized_decomposed namespace to more
primitive aten operators, this would free us from maintaining the semantics of the quantize/dequantize
operators, which can be expressed more precises in terms of underlying aten operators

Note: this PR just adds them to the decomposition table, we haven't enable this by default yet

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_q_dq_decomposition

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93312
Approved by: https://github.com/vkuzo, https://github.com/SherlockNoMad
2023-02-08 17:26:01 +00:00
cyy
a405c6993f [submodule] update libfmt to tag 9.1.0 (#93219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93219
Approved by: https://github.com/malfet, https://github.com/Skylion007, https://github.com/albanD
2023-02-08 17:21:39 +00:00
8ba87fa525 [dynamo] fix general attr on tensor for user-provided attributes (#94332)
**Problem**: For a tensor `x`, you can assign `x.my_attr = 3.14` and then later access it. Dynamo does not support this right now; it errors out with an AttributError (it was broken in #91840).

**Fix**: This fixes the problem by catching AttributeErrors in dynamo if we try to access an attr that does not exist on a standard torch.Tensor.

**Tests**: Added tests for accessing and setting attributes to make sure dynamo does not error out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94332
Approved by: https://github.com/yanboliang
2023-02-08 17:11:18 +00:00
f65a206433 Revert "sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048)"
This reverts commit 513b5da3573ffb542ac056dbc6142780a6fb43a5.

Reverted https://github.com/pytorch/pytorch/pull/94048 on behalf of https://github.com/jeanschmidt due to issues with older versions of vs code
2023-02-08 16:51:07 +00:00
e44cd942e3 [MPS] Fix the crash with hardswish_backward() (#94342)
Also fix indentation and formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94342
Approved by: https://github.com/kulinseth
2023-02-08 16:42:19 +00:00
eb1aca162e Re-enable cudagraphs for benchmark scripts (#94192)
Related to https://github.com/pytorch/pytorch/pull/93253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94192
Approved by: https://github.com/albanD, https://github.com/desertfire
2023-02-08 16:38:32 +00:00
fe0e28ab87 [decompositions] GRU decompositon with and without packed sequence (#91466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91466
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
5a7c1b7894 [decompositions] LSTM with packed input (#91465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91465
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
bef61225c3 [decompositions] add decomposition for RNN with packed sequence (#91281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91281
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
e5f6e1f660 [decompositions] add LSTM decomp (#91124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91124
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
20d01d2dc9 [expanded weights] add RNN support via decomp (#91807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91807
Approved by: https://github.com/albanD
2023-02-08 14:16:30 +00:00
c2a92687e0 [decompositions] add RNN decomp and testing (#91123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91123
Approved by: https://github.com/zou3519
2023-02-08 14:16:30 +00:00
768e547543 Fix SIGFPE in slow_conv3d_forward_out_cpu (#94325)
Set number of groups to 0 if weights second dimension is zero.

`slow_conv_shape_check` will raise an exception if groups are zero anyway.

Fixes SIGFPE reported in https://github.com/pytorch/pytorch/issues/94125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94325
Approved by: https://github.com/albanD
2023-02-08 14:15:39 +00:00
73bf32cb57 Bump to stable ONNX 1.13.0 (#90332)
ONNX had mismatch checker usage between cpp and python and it's later fixed by https://github.com/onnx/onnx/pull/4386. And since `torch.onnx.export` is using cpp checker for graph-level check with older version of ONNX,this improvement should be added. Also, this version bump enables #83186

Updated 12/5/2022:
This PR includes ONNX 1.13.0 release (https://github.com/onnx/onnx/tree/rel-1.13.0)

For [CVE-2022-25882](https://nvd.nist.gov/vuln/detail/CVE-2022-25882)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90332
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-02-08 11:49:06 +00:00
6f543e0d0a add not_close_error_metas for internal comparison machinery (#90004)
While discussing a possible addition of `assert_not_close` to the API (See #90005 later in the stack), it became clear that we should have an intermediate function that returns a bool-ish value that one can assert on. This PR introduces this function as `are_equal` as replacement for `assert_equal`. Interface is the same, but instead of raising in case a comparison failed, we return the `ErrorMeta`'s of all failures and leave it to the caller to handle. Note that this only applies to errors raised during the comparison stage. Everything else, e.g. only setting `atol` *or* `rtol`, will raise just as before.

We decided to keep this private for now unless there is user demand. The largest issue that needs to be solved before this can become public is the return type: if we have something like `torch.testing.are_close` we are targeting two uses cases:

1. Using it to branch inside code like `if are_close(...):`
2. Using it to assert closeness inside a test like `assert are_close(...)`. This is the default way to assert something with `pytest`

To do that, the return type has to be bool-ish, i.e. being an instance of `bool` or implementing `__bool__`. Plus, `bool(are_close()) is True` needs to be the if the inputs are close and `False` otherwise. The current logic of `are_close` satisfies the former, but violates the latter. In case everything is close, we return an empty list, but `bool([]) is False`.

Directly using an instance of `bool` would work for the requirements above, but then we would have no option to add diagnositics to the error. Meaning `assert are_close()` would work, but would be non-descriptive.

Using `Tuple[bool, str]` would work in general, but is quite dangerous and unexpected: since all non-empty tuples evaluate to `True`, this can easily hide bugs if the user is not super careful:

```pycon
>>> close = (False, "error message with diagnostics")
>>> assert close[0]
AssertionError: error message with diagnostics
>>> assert close
```

One possible solution here would be a thin custom object:

```py
class Close:
    def __init__(self, flag:bool, msg: str = "") -> None:
        self._flag = flag
        self._msg = msg

    def __bool__(self):
        return self._flag

    def __str__(self):
        return self._msg
```

Now we can do something like

```pycon
close = Close(False, "error message with diagnostics")  # coming from are_close
>>> if not close:
...     print("It works!")
It works!
>>> assert close
AssertionError
>>> assert close, close  # This looks weird, but does its job
AssertionError: error message with diagnostics
```

But this means we introduce another abstraction that the user has to deal with.

To reiterate, we are not going to make `are_close` public until there is user demand, since none of the options above is without flaws.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90004
Approved by: https://github.com/mruberry, https://github.com/malfet
2023-02-08 11:22:55 +00:00
566eb49ed2 minor internal cleanup in assert_close (#90003)
Per title. I'm going to highlight them with inline comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90003
Approved by: https://github.com/mruberry, https://github.com/malfet
2023-02-08 11:22:55 +00:00
bbe33532ae Rename DynamicShapeVariable to SymNodeVariable cause thats what it is (#94152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94152
Approved by: https://github.com/ezyang
2023-02-08 10:41:10 +00:00
cd057390b5 [quant][fx][pt2e] cleanup the args for some helper functions (#94352)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94352
Approved by: https://github.com/vkuzo
2023-02-08 08:39:21 +00:00
1767026d1e Abstract the optimization context information as a dedicated class to better organize the code (#92057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92057
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-08 08:25:22 +00:00
e0c24ec2a5 Print fqn in the warning message (#94313)
Print fqn in the warning message, also make "else" match with the "if" in _apply_to_modules()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94313
Approved by: https://github.com/fegin
2023-02-08 06:45:53 +00:00
e16daa78a0 [PT-D][Checkpoint] Turn on all default planner flags (#92933)
Fixes #92823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92933
Approved by: https://github.com/kumpera
2023-02-08 06:30:45 +00:00
230c4fe93d [GHF] Fix pushDate handling (#94364)
Merge commits does not have a merge date, which is also clear from [GraphQL schema](https://docs.github.com/en/graphql/reference/objects#commit).
Modify return signature of `GitHubPR.last_pushed_at`, print warning when one can not be queried and add regression test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94364
Approved by: https://github.com/huydhn
2023-02-08 05:52:03 +00:00
5fe72b8716 [Dynamo] modify dynamo ipex backend (#94169)
1. Extend fake_tensor_unsupported to support dynamic shapes mode.
2. Use fake_tensor_unsupported  in dynamo ipex backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94169
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-08 05:10:42 +00:00
877482ebc4 [MPS] Fix crashes in several backward ops (#94343)
This should fix the hard crashes in several backward-pass ops for sigmoid, tanh, masked_fill, linear, prelu, etc.
The tests cases that this patch fixes are part of a bigger change in TestConsistency and will be upstreamed as a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94343
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-02-08 04:47:28 +00:00
61ecaf1dd4 [vision hash update] update the pinned vision hash (#94358)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94358
Approved by: https://github.com/pytorchbot, https://github.com/malfet
2023-02-08 04:03:30 +00:00
5f25c0831c Cleanup hung Windows processes (#94357)
Follow https://stackoverflow.com/questions/40585754/powershell-wont-terminate-hung-process to see if the hung python process can be killed completely

```
C:\Jenkins\Miniconda3\python.exe -bb test_ops.py -v --use-pytest -vv -rfEX -x --reruns=2 --shard-id=0 --num-shards=2 "-k=not linalg_cholesky" --import-slow-tests --import-disabled-tests
```

The command `Get-Process -Name $process -ErrorAction Stop | Stop-Process -Force` doesn't stop this process as expect

### Testing

1. Spinning up a local python process on Windows runner `C:\Jenkins\Miniconda3\python.exe debug.py`
2. See that the process is runnning

```
Get-WmiObject -Class Win32_Process -Filter "Name LIKE 'python%' AND CommandLine LIKE '%debug%'"

__GENUS                    : 2
__CLASS                    : Win32_Process
__SUPERCLASS               : CIM_Process
__DYNASTY                  : CIM_ManagedSystemElement
__RELPATH                  : Win32_Process.Handle="8812"
__PROPERTY_COUNT           : 45
__DERIVATION               : {CIM_Process, CIM_LogicalElement, CIM_ManagedSystemElement}
__SERVER                   : EC2AMAZ-S19AQ2Q
__NAMESPACE                : root\cimv2
__PATH                     : \\EC2AMAZ-S19AQ2Q\root\cimv2:Win32_Process.Handle="8812"
Caption                    : python.exe
CommandLine                : "C:\Jenkins\Miniconda3\python.exe" debug.py
CreationClassName          : Win32_Process
CreationDate               : 20230208002358.569943+000
CSCreationClassName        : Win32_ComputerSystem
CSName                     : EC2AMAZ-S19AQ2Q
Description                : python.exe
ExecutablePath             : C:\Jenkins\Miniconda3\python.exe
ExecutionState             :
Handle                     : 8812
HandleCount                : 82
InstallDate                :
KernelModeTime             : 312500
MaximumWorkingSetSize      : 1380
MinimumWorkingSetSize      : 200
Name                       : python.exe
OSCreationClassName        : Win32_OperatingSystem
OSName                     : Microsoft Windows Server 2019 Datacenter|C:\Windows|\Device\Harddisk0\Partition1
OtherOperationCount        : 1135
OtherTransferCount         : 150908
PageFaults                 : 2442
PageFileUsage              : 5020
ParentProcessId            : 5396
PeakPageFileUsage          : 5120
PeakVirtualSize            : 4368465920
PeakWorkingSetSize         : 9424
Priority                   : 8
PrivatePageCount           : 5140480
ProcessId                  : 8812
QuotaNonPagedPoolUsage     : 8
QuotaPagedPoolUsage        : 63
QuotaPeakNonPagedPoolUsage : 8
QuotaPeakPagedPoolUsage    : 63
ReadOperationCount         : 88
ReadTransferCount          : 519894
SessionId                  : 0
Status                     :
TerminationDate            :
ThreadCount                : 1
UserModeTime               : 156250
VirtualSize                : 4362371072
WindowsVersion             : 10.0.17763
WorkingSetSize             : 9592832
WriteOperationCount        : 0
WriteTransferCount         : 0
PSComputerName             : EC2AMAZ-S19AQ2Q
ProcessName                : python.exe
Handles                    : 82
VM                         : 4362371072
WS                         : 9592832
Path                       : C:\Jenkins\Miniconda3\python.exe
```

3. Kill it
```
(Get-WmiObject -Class Win32_Process -Filter "Name LIKE 'python%' AND CommandLine LIKE '%debug%'").terminate()

__GENUS          : 2
__CLASS          : __PARAMETERS
__SUPERCLASS     :
__DYNASTY        : __PARAMETERS
__RELPATH        :
__PROPERTY_COUNT : 1
__DERIVATION     : {}
__SERVER         :
__NAMESPACE      :
__PATH           :
ReturnValue      : 0
PSComputerName   :
```

4. Confirm that the process is killed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94357
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-02-08 03:45:41 +00:00
68b35017a9 Tiny unimplemented improvements (#94150)
fix names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94150
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-02-08 02:57:29 +00:00
b191a5f75f Remove overly strict assert, add test (#94151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94151
Approved by: https://github.com/ezyang
2023-02-08 02:57:29 +00:00
88ef4739b2 Check the semantic of loading the mask value (#91755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91755
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-08 02:34:22 +00:00
83275d8cdf add torch.autograd._set_view_replay_enabled, use in aot autograd (#92588)
tldr; this should fix some minor perf regressions that were caused by adding more as_strided() calls in aot autograd.

This PR adds a new context manager, `torch.autograd._set_view_replay_enabled()`.

Context: AOT Autograd has special handling for "outputs that alias graph intermediates". E.g. given this function:

```
def f(x):
    y = torch.mul(x, 2)
    out = y.view(-1)
    return out
```

AOT Autograd will do the following:

```
def fn_to_compile(x):
    y = torch.mul(x, 2)
    out = y.view(-1)
    # return the graph intermediate
    return y, out

compiled_fn = compile(fn_to_compile)

def wrapper(x):
    y, out = compiled_fn(x)
    # regenerate the alias of the graph intermediate
    return out._view_func(y)
```

What's annoying is that `out._view_func()` will result in a `.as_strided` call, because `out` is an ordinary runtime tensor. This (likely?) caused a perf regression, because when running the backward, out `as_strided_backward()` is slower than our `view_backward()`.

In this PR, I added some TLS for instructing autograd to do view replay instead of as_strided, even when given a normal tensor. I'm definitely interested in thoughts from autograd folks (cc @albanD @soulitzer). A few points that I want to bring up:

(1) One reason that this API seems generally useful to me is because of the case where you `torch.compile()` a function, and you pass in two inputs that alias each other, and mutate one of the inputs. Autograd is forced to add a bunch of as_strided() calls into the graph when this happens, but this would give users an escape hatch for better compiled perf in this situation

(2) To be fair, AOT Autograd probably won't need this TLS in the long term. There's a better (more complicated) solution, where AOT Autograd manually precomputes the view chain off of graph intermediates during tracing, and re-applies them at runtime. This is kind of complicated though and feels lower priority to implement immediately.

(3) Given all of that I made the API private, but lmk what you all think.

This is a followup of https://github.com/pytorch/pytorch/pull/92255.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-08 01:48:32 +00:00
333e771394 Add benchmarks.py to run all benchmarks, add new file with all torchbench model names (#94146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94146
Approved by: https://github.com/ezyang
2023-02-08 01:18:38 +00:00
cyy
5fa7120722 Simplify CMake CUDNN code (#91676)
1. Move CUDNN code to seperate module.
2. Merge CUDNN public and private targets into a single private target. There is no need to expose CUDNN dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91676
Approved by: https://github.com/malfet
2023-02-08 01:06:10 +00:00
cyy
9291f9b9e2 Simplify cmake code (#91546)
We use various newer CMake features to simplify build system:
1.Caffe2::threads is replaced by threads::threads.
2.Some unused MSVC flags are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91546
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-08 01:05:19 +00:00
c981b7e572 [MPS] Add MPSAllocatorInterface to access methods of MPSAllocator (#94327)
This is a prerequisite for the upcoming PR's for the MPS Modules and Memory Leak Detection features.
Also added pragma once to headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94327
Approved by: https://github.com/kulinseth
2023-02-08 00:59:36 +00:00
51b487bf51 [inductor] fix cpu implementation of argmax / argmin (#94165)
Fixes #94055

When the reduction numel equals to 1, inner function of argmax / argmin is `return 0`. This inner function losts the data type of `0`, which may result in conflicting types for subsequent calculations. This PR keeps the data type in inner function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94165
Approved by: https://github.com/jgong5, https://github.com/Neilblaze, https://github.com/jansel
2023-02-08 00:54:10 +00:00
94394e568e change the dynamo benchmark timeout as a parameter (#94284)
Change the dynamo benchmark timeout from hard code to a parameter with default value 1200ms, cause the hard code 1200ms timeout led some single thread mode model crashed on CPU platform. With the parameter, users can specify the timeout freely.

Fixes #94281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94284
Approved by: https://github.com/malfet
2023-02-08 00:45:08 +00:00
f48b4d8842 Handle sympy in split (#94285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94285
Approved by: https://github.com/SherlockNoMad, https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel
2023-02-08 00:32:19 +00:00
3ce1ebb6fb Apply some safe comprehension optimizations (#94323)
Optimize unnecessary collection cast calls, unnecessary calls to list, tuple, and dict, and simplify calls to the sorted builtin. This should strictly improve speed and improve readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94323
Approved by: https://github.com/albanD
2023-02-07 23:53:46 +00:00
bef2483ed8 [NestedTensor] Call contiguous in linear backward (#94317)
Fixes #94303

If in upward grad for linear_backward was discontiguous we would throw a torch check. This updates the implementation to instead call contiguous and changes the check to an internal assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94317
Approved by: https://github.com/mikaylagawarecki
2023-02-07 23:43:46 +00:00
ab4fe01e72 [FSDP][optim_state_dict] Returns the initial states of the empty parameters for KeyedOptimizer/NamedOptimizer (#94130)
KeyedOptimizer and NamedOptimizer expect the states exist in the state_dict when `load_state_dict` is called even if the corresponding parameters are empty (size == 0). This PR adds the support to make KeyedOptimizer work with `use_orig_params=True`.

Differential Revision: [D43019458](https://our.internmc.facebook.com/intern/diff/D43019458/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94130
Approved by: https://github.com/rohan-varma
2023-02-07 23:36:56 +00:00
ec25db7741 torch.inference_mode: add type hints (#94223)
Copied the type hints from the other context managers.

Not sure how to add type hints for `clone` since it returns the same class. The `Self` type isn't introduced until Python 3.11 and mypy just recently added support for it. Could also use `"inference_mode"` with quotes to avoid using it before it's declared, or `from __future__ import annotations` to allow its use without quotes. Or we could just skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94223
Approved by: https://github.com/albanD
2023-02-07 23:16:55 +00:00
75e04f6dad Test enabling full testing on 3.11 for linux (#94056)
Testing what happens if we run everything right now.
Will remove the broken stuff to get a a mergeable version next.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94056
Approved by: https://github.com/malfet
2023-02-07 23:02:13 +00:00
34bbd7af87 Use the right run_test for inductor opinfo tests (#94312)
One of the side effect of this is that this is not properly skipped on 3.11
As a side note, it was very surprising to find testing-specific code in `torch._dynamo` and not `torch.testing`...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94312
Approved by: https://github.com/ezyang
2023-02-07 23:02:13 +00:00
d16c2c36ad Add another missing decomp (#94113)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94113
Approved by: https://github.com/jansel
2023-02-07 21:32:56 +00:00
6b8eb0eb04 [vulkan] Add core graph components (#94222)
Summary:
This diff introduced the core components needed for the Vulkan Graph runtime.

* ComputeGraph data structure
* Value data structure
* Copy node
* Add node with option for prepacked weights

Test Plan:
Run the `delegate_experiment` binary.

```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_use_gpu_diagnostics=1 :delegate_experimentAppleMac\#macosx-arm64
```

Differential Revision: D42614155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94222
Approved by: https://github.com/salilsdesai
2023-02-07 21:15:17 +00:00
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
567e6152da Revert "[inductor] fix crash issue when input is a view tensor (#90150)" (#94329)
Had to provide a merge conflict resolution due to conflicts with https://github.com/pytorch/pytorch/pull/94118

This was causing issues with internal tests that look similar to:
```
in clone_preserve_strides
    x.size(), x.stride(), x.storage_offset()
AttributeError: 'KeyedJaggedTensor' object has no attribute 'size'
```

See https://fburl.com/testinfra/nc0du2sp for more information

This reverts commit #90150

@jansel can you help @blzheng with re-landing this as a co-development diff?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94329
Approved by: https://github.com/jansel
2023-02-07 20:45:58 +00:00
7b3217e6a2 Add deprecation warning to reduce flag of scatter for Tensor src and redirect to scatter_reduce (#94282)
Address #94082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94282
Approved by: https://github.com/albanD
2023-02-07 20:22:22 +00:00
748bac8757 [BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309)
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
2023-02-07 20:08:58 +00:00
895d4781b8 [easy] Add NestedTensorMeta to parseDispatchKey (#94279)
ran into this when trying to use `torch.library.Library("aten", "IMPL", "NestedTensorMeta")`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94279
Approved by: https://github.com/bdhirsh
2023-02-07 19:46:29 +00:00
8c835a9e52 Factor out SYMPY_INTERP (#94307)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94307
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-02-07 19:23:11 +00:00
e1f17b3530 Add CSR->BSC and CSC->BSR conversions (#93301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93301
Approved by: https://github.com/cpuhrsch
2023-02-07 19:22:05 +00:00
d690a596dc Fast path binary ops in fake tensor (#94047)
Fast path execution of a few binary ops in fake tensor, to speed up trace time. When testing `python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18`, I get the following trace speedup.

Before:

```
cuda eval  hrnet_w18                           PASS
TIMING: entire_frame_compile:53.97591 backend_compile:33.60832
STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010
```

After:

```
cuda eval  hrnet_w18                           PASS
TIMING: entire_frame_compile:40.18931 backend_compile:25.28828
STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:69478 | attempt fast:4399 | fast is_contiguous:4399 | ProxyTorchDispatchMode.__torch_dispatch__:3010
```

My experiment notebook can be found at https://docs.google.com/document/d/1_dTIQUwjIVnEWmiFAavJQYVF8uzXqD9Dk6b9gGQLF_U/edit#

This is not the "most" optimized version of the code; compared with Horace/Voz roofline experiment:

```
diff --git a/torch/_subclasses/fake_tensor.py b/torch/_subclasses/fake_tensor.py
index e3bf545f3b8..395942c6ffe 100644
--- a/torch/_subclasses/fake_tensor.py
+++ b/torch/_subclasses/fake_tensor.py
@@ -774,6 +774,10 @@ class FakeTensorMode(TorchDispatchMode):
     def __torch_dispatch__(self, func, types, args=(), kwargs=None):
         kwargs = kwargs if kwargs else {}

+        with no_dispatch():
+            if func in {aten.mul.Tensor, aten.add.Tensor, aten.sub.Tensor, aten.relu.default}:
+                return FakeTensor(self, torch.empty(args[0].shape, device='meta'), device='cuda')
+
         if func == torch.ops.prim.device.default:
             assert len(args) == 1 and isinstance(args[0], FakeTensor)
             if args[0].fake_mode.in_kernel_invocation:
```

I am still leaving about 5s of trace time improvement on the table (3s of which is attributable to not yet handling relu.)

The implementation here is based off of https://github.com/pytorch/pytorch/pull/93118/ but I modeled the short circuit logic off of TensorIterator's implementation, for ease of code review and correctness verification. However, there are some important divergences:

* Traditional fast setup in TensorIterator only short circuits if the shapes of all input elements are equal. On hrnet_w18, only 5% of fastpath'ed binary operators actually satisfy this. So instead, I compute the broadcasted shape, but then I only allow the fast path if (1) at least one input tensor has a shape that is exactly the output size, and (2) all the tensors are contiguous (or if all the tensors are channels last).
* I had to manually adjust the logic to handle wrapped numbers (which ordinarily are handled by wrapping into tensors). I think I got this right.

Some evidence that this heuristic is correct is here in: https://gist.github.com/ezyang/b22fa7b72b7349137211d8dc7041f758 I exhaustively test all dim=3 tensors with sizes [1, 2] and show that we get the same significant strides between PrimTorch and the new algorithm. In fact, there ARE differences between this algorithm and PrimTorch, but in fact this algorithm agrees with TensorIterator where PrimTorch is wrong (sample case: size=(1, 1, 2), stride=(1, 1, 1), stride=(1, 1, 1))

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94047
Approved by: https://github.com/eellison
2023-02-07 18:34:24 +00:00
0603f4ff14 temp fix for segment reduce undocumented FC window (#94242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94242
Approved by: https://github.com/malfet
2023-02-07 18:27:01 +00:00
a88c15a849 Build Windows binaries with Visual Studio 2022 Build Tools (#90855)
This PR enables VS 2022 binaries for build and test jobs. Another PR pytorch/builder#1240 is doing majority of the work.

Closes #87695.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90855
Approved by: https://github.com/jeanschmidt, https://github.com/seemethere
2023-02-07 18:15:29 +00:00
e0950fccfa [SDPA] Add expanded autograd testing for fused kernels and disable head_dim128 sm86 mem-efficient (#94009)
# Summary
- Adds a large parameter sweep for testing the various configs a user can call sdpa with and compares the deviation of the fused kernels vs the eager math fallback to test for correctness.
- Sm86 + head_dim==128 is throwing an IMA  for memory efficient attention. We add a filter for use_mem_efficient_attention().  This has since been fixed in the upstream Xformers version but will likely not make it for branch cut.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94009
Approved by: https://github.com/cpuhrsch
2023-02-07 18:04:48 +00:00
7bba87ed06 add rsub decomposition with alpha (#94144)
Fixes #93376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94144
Approved by: https://github.com/desertfire
2023-02-07 17:21:13 +00:00
e9533767af trymerge to ignore certain failures (#91134)
For any failure in dr ci listed as "flaky" or "broken trunk" (aka anything not "new failures"), these get marked as "ok to fail".

If there are a small number (currently set to 3) ok to fail jobs, merge can still continue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91134
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2023-02-07 17:19:57 +00:00
b07c839b70 COO intersection kernel: respect value intersection order (#92242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92242
Approved by: https://github.com/cpuhrsch, https://github.com/amjames
2023-02-07 17:05:28 +00:00
0b2dc3b3ac [Py-3.11] Skip dynamo related tests (#94187)
The quantization test fails to import Dynamo as expected.
The traceback tool looks a lot more tricky, opened https://github.com/pytorch/pytorch/issues/94189 to investigate further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94187
Approved by: https://github.com/malfet
2023-02-07 16:40:55 +00:00
5d48392abb [MPS] Skip gather/blit calls in case of strided output (#94260)
Skip gather/blit calls in case of strided output - this prevents:

- allocating additional memory for the output
- additional transpose for both the input and output
Fixes:
```
x = torch.rand((256,10), device='mps')
x = x.permute(1,0)
x.exp()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94260
Approved by: https://github.com/razarmehr
2023-02-07 16:25:03 +00:00
86ae14deaa [MPS] Fix MPSGraph casting issue to MPSDataTypeBool in masked_fill op (#94263)
Fixes TestConsistency masked_fill for bool data type.

Casting a tensor > 1 to MPSDataTypeBool will result in 0 instead of 1. This change manually casts the scalar to a value of 0 or 1 when casting a non-boolean tensor to a boolean tensor:
```
(inputDataType == MPSDataTypeBool) ? !!value.to<double>() : value.to<double>()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94263
Approved by: https://github.com/razarmehr
2023-02-07 16:20:55 +00:00
e3ac109618 [MPS] Fallback on gather code to solve view tensors when a slice is followed by a reshape (#94278)
There are cases when the arrayViewTensor API cannot be used to solve the view operations, such as when a view dimension is bigger than the base dimension of the tensor, e.g:
```
base shape: [1, 768, 512, 2] // we cannot slice the base shape in any way to result in first dimension `2`
view shape: [2, 384, 512, 1]
```
On such cases, we need to fallback on the gather code (that detects this is a slice followed by a reshape) to solve this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94278
Approved by: https://github.com/razarmehr
2023-02-07 16:20:08 +00:00
4cd086b14c [MPS] Raise error for int64 inputs of dot operator. (#94270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94270
Approved by: https://github.com/razarmehr
2023-02-07 16:12:17 +00:00
b654d1494b [MPS] Fix the argument error for tensor_split() test (#94234)
The second tensor argument `tensor_indices_or_sections` of tensor_split() must be on CPU when testing it in TestConsistency. Otherwise it will error out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94234
Approved by: https://github.com/kulinseth
2023-02-07 15:56:49 +00:00
a3ca66c69e [MPS] Remove the unused code for view lists in OperationUtils.h (#94265)
Clean up redundant code that was added before and not needed anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94265
Approved by: https://github.com/kulinseth
2023-02-07 15:56:05 +00:00
a0a3728069 [MPS] Don't reset the Graph state (#94283)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94283
Approved by: https://github.com/razarmehr
2023-02-07 15:52:44 +00:00
36062dd2b4 [MPS] Fix the crash in View ops when slicing wrong lengths (#94259)
The offset + length of destination tensor should not be larger than source's length when slicing

Fixes #94190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94259
Approved by: https://github.com/malfet
2023-02-07 15:51:26 +00:00
bf4fe5dddd General in-place binary op support in dynamo (#94203)
Continues the approach taken in #93271, expanding support to in-place binary ops (e.g. `__iadd__`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94203
Approved by: https://github.com/ezyang
2023-02-07 15:12:32 +00:00
f954498edf Dynamo: Fix to unpack ConstantVariable in call_range() (#94202)
Fixes the `pyhpc_turbulent_kinetic_energy` model in torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94202
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-02-07 15:12:00 +00:00
c4544bc169 Fix thread-allocation in _vec_log_softmax_lastdim (#85398)
## Problem history

There seems to always have been a bug in `_vec_log_softmax_lastdim `.
In particular, there were two issues with it -

#### Bug 1
 Before AVX512 support was added, `CHUNK_SIZE` had been heuristically chosen in `_vec_log_softmax_lastdim`:
 `CHUNK_SIZE = (128 / sizeof(scalar_t)) * Vec::size();`

It was  `256` for float32, bfloat16, and float16.
When AVX512 support was added, `CHUNK_SIZE` became `512`.

The rationale behind determining `CHUNK_SIZE` has not been described, and seems flawed, since the number of OpenMP threads used currently depends upon it.

#### Bug 2
`grain_size` had been defined as `internal::GRAIN_SIZE / (16 * dim_size * CHUNK_SIZE)`
So, `grain_size` was usually 0, as it was `8 / (dim_size)`, so, it's always replaced by `CHUNK_SIZE`, viz. 256.
Since `256` was always the `grain_size` for `at::parallel_for`, few threads were used in certain cases.

#### Problem caused by bugs
With `outer_size` of say, 700, only 3 threads would have been used with AVX2, irrespective of the value of `dim_size`!
When AVX512 support was added, since `CHUNK_SIZE` became `512`, only 2 threads were used if `outer_dim` was 700.
In the Transformers training example, `log_softmax` was computed on the last dim of a tensor of shape `(700, 23258)`.
AVX512 thus appeared to be quite slower, cloaking the actual issue that even AVX2 performance for the kernel was quite poor due to inefficient work distribution amongst OpenMP threads.

## Solution
Distribute work more efficiently, which would result in higher performance for both AVX2 & AVX512 than now,
and fixes the regression observed with AVX512 (AVX512 kernel would now be faster than its AVX2 counterpart).

## Benchmarks

##### Machine-config:
Intel(R) Xeon(R) Platinum 8371HC CPU (Cooper Lake)
One socket of 26 physical cores was used.
Intel OpenMP & tcmalloc were preloaded.

Example of a command to run benchmark:
`ATEN_CPU_CAPABILITY=avx512 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 MKL_NUM_THREADS=26 OMP_NUM_THREADS=26 numactl --membind=0 --cpunodebind=0 python3.8 -m pt.softmax_test --test_name LogSoftmax_N1024_seq_len23258_dim1_cpu`

Benchmark | Old implementation time (us) | New implementation time (us) | Speedup ratio (old/new)
-- | -- | -- | --
LogSoftmax_N1024_seq_len23258_dim1_cpu AVX2 | 11069.281 | 2651.186 | 4.17x
LogSoftmax_N1024_seq_len23258_dim1_cpu  AVX512 | 18292.928 | 2586.550| 7.07x
LogSoftmax_N700_seq_len23258_dim1_cpu  AVX2 | 9611.902 | 1762.833 | 5.452x
LogSoftmax_N700_seq_len23258_dim1_cpu  AVX512 | 12168.371  | 1717.824 | 7.08x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85398
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/peterbell10, https://github.com/lezcano
2023-02-07 15:09:05 +00:00
a2ac25f63e update test fixture (#89796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89796
Approved by: https://github.com/davidberard98
2023-02-07 14:58:57 +00:00
513b5da357 sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048)
As per title. Sync is still unavoidable for super high-dim tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94048
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
2023-02-07 12:43:12 +00:00
42b6bcdb13 [BE] Add empty tensor check to _compute_linear_combination (#94245)
Fixes https://github.com/pytorch/pytorch/issues/94124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94245
Approved by: https://github.com/lezcano
2023-02-07 11:31:11 +00:00
a28a062938 [Inductor] Fix CPU vectorized implementation of mask calculation that breaks torch.where (#93922)
Fix https://github.com/pytorch/pytorch/issues/93374

The cause of the issue is that the original vectorized float mask calculation doesn't consider the broadcast case. This PR adds the support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93922
Approved by: https://github.com/XiaobingSuper, https://github.com/desertfire, https://github.com/jansel
2023-02-07 11:30:21 +00:00
0e94fbc0c8 [inductor] bug fix: use create_symbolic_sizes_strides_storage_offset (#94031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94031
Approved by: https://github.com/ezyang
2023-02-07 09:52:25 +00:00
900e09c872 [Dynamo] Support torch.Tensor.fn as TorchVariable, not UserDefinedObjectVariable, preventing graph break (#93243)
As found in #92709, thanks to @ngimel and @jansel, currently `torch.Tensor.fn` points to `UserDefinedObjectVariable` rather than `TorchVariable`. The root cause is due to https://github.com/pytorch/pytorch/pull/92709#pullrequestreview-1273357406. To prevent this, build `TorchVariable`  of `torch.Tensor.fn` pointing to `torch.ops.aten.fn`.

This issue propagates to `torch.Tensor.fn` causing graph break with `nopython=True`.
```python
import torch
import torch._dynamo as dynamo

#op = torch.ops.aten.abs_ # no graph break
op = torch.Tensor.abs_ # graph break
args = torch.empty(10)

def foo(args):
    return op(args)

opt_foo = dynamo.optimize("inductor", nopython=True)(foo)
y_ = opt_foo(args)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93243
Approved by: https://github.com/jansel
2023-02-07 09:26:50 +00:00
d6dec1a5cf Refactor sharding data pipe into a seperate file (#94095)
Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  So we don't need to based on whether `is_shardable` and `apply_sharding` are presented in DataPipe in `graph_settings.py`. But open to other discussions.

Open question: Should
[ShardingRoundRobinDispatcherIterDataPipe](01fc762003/torchdata/datapipes/iter/util/sharding.py (L16-L17)) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

Differential Revision: D43014692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94095
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-02-07 09:12:02 +00:00
59c1b5025f [quant][fx][pt2e] Refactor prepare so it's aligned better with the new API plan in pt2e (#94011)
Summary:
There are three things that happens in the current prepare code,
(1). user express their intention of how they want the model to be quantized with QConfigMapping, we translate that to
node.meta["target_dtype_info"]
(2). we validate the setting against BackendConfig
(3). insert observers based on the validated node.meta["target_dtype_info"]

previously (2) and (3) are mixed together, this PR tries to move (2) closer to (1), with one edge case left, this refactor
moves us closer to our target design for quantization in pytorch 2.0 export path

this is a follow up PR for https://github.com/pytorch/pytorch/pull/92641

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94011
Approved by: https://github.com/vkuzo
2023-02-07 08:23:56 +00:00
ffb3561caa [Docs] Add pointer to FlashAttention paper (#94253)
As discussed with @drisspg, we're adding pointers to the docs for MHA and Transformers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94253
Approved by: https://github.com/drisspg, https://github.com/malfet
2023-02-07 08:05:10 +00:00
f92348e13d Clean up mentions of removed torch/csrc/generic/*.cpp (#94107)
Summary: The dir was removed in https://github.com/pytorch/pytorch/pull/82373.

Test Plan: Sandcastlle + GitHub CI.

Differential Revision: D43016100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94107
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-02-07 07:17:16 +00:00
bc8a378333 [MPS] Unregister put_() op due to lack of implementation (#94231)
Currently, the `put_()` is not implemented on MPS backend, so this patch will unregister it and insert it into blocklist of TestConsistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94231
Approved by: https://github.com/kulinseth
2023-02-07 06:54:15 +00:00
bc6d54f6d8 [FSDP][optim_state_dict] Let optim_state_dict ignore the non-FSDP managed parameters that do not reside on the rank (#94129)
When FSDP is used with other parallelism (e.g., TorchRec), some parameters that are not managed by FSDP may not reside on all the ranks (TorchRec is model parallelism). When `use_orig_params=True` , FSDP will synchronize the FQNs among ranks. As a result, a rank may get the FQNs that the rank does not actually own. If the FQN belongs to a TorchRec managed parameter, FSDP has to ignore the parameter state. Otherwise FSDP does not know how to store the state.

This PR add the logic to ignore the parameters that are not managed by FSDP and are not on the rank.

Differential Revision: [D42982778](https://our.internmc.facebook.com/intern/diff/D42982778/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94129
Approved by: https://github.com/rohan-varma
2023-02-07 06:29:28 +00:00
f04106f1c2 [FSDP][state_dict] Fix incorrect valid_data_size for local_state_dict when some ranks have zero data. (#94109)
When using `torch.chunks` to split the `flat_param`, some ranks may have zero data and `local_state_dict` does not handle the case correctly -- `local_state_dict` won't resize the local tensor to an empty one. This PR fixes the issue.

Differential Revision: [D43004643](https://our.internmc.facebook.com/intern/diff/D43004643/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94109
Approved by: https://github.com/zhaojuanmao
2023-02-07 06:20:40 +00:00
605b661805 FakeTensor should constant propagate through ops that allow numbers as scalars (#94145)
Fixes #92655

Thanks @eellison for the code change suggestion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94145
Approved by: https://github.com/eellison
2023-02-07 06:20:35 +00:00
579ae64d81 [mobile] List all missing ops at once (#94205)
List all missing ops rather than early termination

Test on device
Logcat lists all operators:
```
12-06 00:23:36.523  8299  8299 F DEBUG   : Abort message: 'terminating with uncaught exception of type c10::Error: Following ops cannot be found: [aten::max_pool2d, aten::conv2d]. Please check if the operator library is included in the build. If built with selected ops, check if these ops are in the list. If you are a Meta employee, please see fburl.com/missing_ops for a fix. Or post it in https://discuss.pytorch.org/c/mobile/ ()
12-06 00:23:36.523  8299  8299 F DEBUG   : Exception raised from initialize_operators at xplat/caffe2/torch/csrc/jit/mobile/function.cpp:89 (most recent call first):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94205
Approved by: https://github.com/JacobSzwejbka
2023-02-07 05:45:57 +00:00
4b0e2e2cc6 Use official NVML Python bindings (#93925)
Use the official NVML Python binding package [`nvidia-ml-py`](https://pypi.org/project/nvidia-ml-py), which is maintained by the NVIDIA NVML team.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93925
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/ptrblck
2023-02-07 05:27:36 +00:00
1063394898 Revert "Add fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries for _GLIBCXX_USE_CXX11_ABI=1 (#93835)"
This reverts commit b562be793a7f9fa8923b09367c320b1c378f6d25.

Reverted https://github.com/pytorch/pytorch/pull/93835 on behalf of https://github.com/huydhn due to This breaks XLA build b562be793a
2023-02-07 04:49:06 +00:00
f1c435d7b4 [vision hash update] update the pinned vision hash (#94241)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94241
Approved by: https://github.com/pytorchbot
2023-02-07 04:40:02 +00:00
b562be793a Add fabi-version=11 to ensure compatibility between gcc7 and gcc9 binaries for _GLIBCXX_USE_CXX11_ABI=1 (#93835)
Fixes #https://github.com/pytorch/pytorch/pull/92550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93835
Approved by: https://github.com/malfet
2023-02-07 03:05:39 +00:00
ca74105377 [MPS] Add scalar params to the softplus key. (#94256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94256
Approved by: https://github.com/razarmehr, https://github.com/malfet
2023-02-07 03:04:53 +00:00
9358726a06 [MPS] Handle empty input in layer norm (#94212)
Handle empty input in layer norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94212
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-02-07 02:55:48 +00:00
d493bc8a76 [MPS] Return input in addcmul/div if value is zero (#94214)
Also remove the unnecessary resize (structured op)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94214
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-02-07 02:38:09 +00:00
fa2b99f402 [MPS] Fix the crash in nan_to_num() with Float16 data type (#94220)
This PR will prevent a crash in `test_output_match_nan_to_num_cpu_float16`, that would otherwise happen with the upcoming updates to MPS Framework in Ventura (in API `logicalANDWithPrimaryTensor()`). The fix is backwards compatible with Monterey too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94220
Approved by: https://github.com/malfet
2023-02-07 02:36:05 +00:00
f15ab8a7f2 AO migration: replace torch internal callsites (#94170)
Summary:

Do the following renames:
`torch.quantization` -> `torch.ao.quantization`
`torch.nn.quantized` -> `torch.ao.nn.quantized`
`torch.nn.quantizable` -> `torch.ao.nn.quantizable`
`torch.nn.qat` -> `torch.ao.nn.qat`
`torch.nn.intrinsic` -> `torch.ao.nn.intrinsic`

And then, do
`torch.ao.nn.quantized._reference` -> `torch.ao.nn.quantized.reference` to clean up the aftermath of https://github.com/pytorch/pytorch/pull/84974

Then, manually update `test/test_module_init.py` to fix hanging whitespace due to the replace.

Run this script to do the replacements: https://gist.github.com/vkuzo/7f7afebf8c31b9ba48306223e68a1c82

This is for https://github.com/pytorch/pytorch/issues/81667

Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94170
Approved by: https://github.com/jerryzh168
2023-02-07 02:32:23 +00:00
a9f57db607 AO migration: migrate .rst files to new locations (#94211)
Summary:

Migrates the PyTorch documentation to point to the new locations
of AO code.  Context: https://github.com/pytorch/pytorch/issues/81667

Process:
1. run https://gist.github.com/vkuzo/c38d4ba201604579d7d316ec4a4692e7 for automated replacement
2. manually fix the doc build errors (by removing the module declarations which are now duplicate)

Test plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94211
Approved by: https://github.com/jerryzh168
2023-02-07 02:32:23 +00:00
368e364c19 [MPS] Fix gradient issues with NLL and Smooth_L1 loss ops (#94226)
- Fix correctness issues with nll_loss_backward(), smooth_l1_loss_backward() and cross_entropy_backward() by taking grad_output into account when computing those loss ops
- Add numel()==0 check to prevent crashes
- Clean up and formatting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94226
Approved by: https://github.com/kulinseth
2023-02-07 01:54:18 +00:00
cyy
bf9be50bb8 Some more fixes (#94049)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94049
Approved by: https://github.com/Skylion007
2023-02-07 01:51:06 +00:00
53e4fe076a Revert "enable bf16 emb (#94163)"
This reverts commit f3bf46e801dec2637751224fd6e27fbf97453bc6.

Reverted https://github.com/pytorch/pytorch/pull/94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk.  For example, 05397b1250
2023-02-07 00:32:22 +00:00
6ba041fcae Look up group["capturable"], not defaults["capturable"] in Adam(W) (#94149)
We could set different values in each `param_group` when calling dunder init of `torch.optim` optimizers as in e.g.  https://github.com/pytorch/pytorch/issues/89987.

So check whether or not `capturable` is `True` among all the `param_group`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94149
Approved by: https://github.com/albanD
2023-02-07 00:24:35 +00:00
0dfc3e1340 Cleanup all leftover processes in MacOS pet runner (#94127)
Despite my initial attempt to clean up MacOS runner as best as I could (https://github.com/pytorch/test-infra/pull/2100, https://github.com/pytorch/test-infra/pull/2102), the runner in question `i-09df3754ea622ad6b` (yes, the same one) still had its free space gradually dropping from 10GB (after cleaning conda and pip packages few days ago) to only 5.2GB today: 4207d3c330

I have a gotcha moment after logging into the runner and the direct root cause is right before my eyes.  I forgot to look at the processes running there:

```
  501  7008     1   0 13Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3912838018 --no-capture-output python3 -m tools.stats.monitor
  501 30351 30348   0 18Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3953492510 --no-capture-output python3 -m tools.stats.monitor
  501 36134 36131   0 19Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3956679232 --no-capture-output python3 -m tools.stats.monitor
  501 36579 36576   0 Mon11PM ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4048875121 --no-capture-output python3 -m tools.stats.monitor
  501 37096 37093   0 20Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3971130804 --no-capture-output python3 -m tools.stats.monitor
  501 62770 62767   0 27Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4025485821 --no-capture-output python3 -m tools.stats.monitor
  501 82293 82290   0 20Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_3969944513 --no-capture-output python3 -m tools.stats.monitor
  501 95762 95759   0 26Jan23 ttys001    0:00.11 /Users/ec2-user/runner/_work/_temp/miniconda/bin/python /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_4012836881 --no-capture-output python3 -m tools.stats.monitor

```

There were many leftover `tools.stats.monitor` processes there.  After pkill them all, an extra 45GB of free space was immediately free up.  Same situation could be seen on other MacOS pet runners too, i.e. `i-026bd028e886eed73`.

At the moment, it's unclear to me what edge case could cause this as the step to stop the monitoring script should always be executed, may be it received an invalid PID somehow.  However, the safety net catch-all solution would be to cleanup all leftover processes on MacOS pet runner before running the workflow (similar to what is done in Windows https://github.com/pytorch/pytorch/pull/93914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94127
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-02-07 00:15:31 +00:00
a595d06c12 [inductor] Avoid re-computing mean in lowering for aten.var_mean (#94139)
The current lowering results in the mean being computed twice. In the following
snippet, both `tmp1` and `tmp8` are the sum of `in_ptr0`:
```python
def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    # ...
    _tmp1 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last')
        _tmp1 = tl.where(rmask, _tmp1 + tmp0, _tmp1)
    tmp1 = tl.sum(_tmp1, 1)[:, None]
    _tmp7 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0
    _tmp8 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp2 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last')
        tmp3 = 100.0
        tmp4 = tmp1 / tmp3
        tmp5 = tmp2 - tmp4
        tmp6 = tmp5 * tmp5
        _tmp7 = tl.where(rmask, _tmp7 + tmp6, _tmp7)
        _tmp8 = tl.where(rmask, _tmp8 + tmp2, _tmp8)
    tmp7 = tl.sum(_tmp7, 1)[:, None]
    tmp8 = tl.sum(_tmp8, 1)[:, None]
    # ...
```

After this change, the mean is computed only once:
```python
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last')
        _tmp1 = tl.where(rmask, _tmp1 + tmp0, _tmp1)
    tmp1 = tl.sum(_tmp1, 1)[:, None]
    tmp2 = 100.0
    tmp3 = tmp1 / tmp2
    tl.store(in_out_ptr0 + (0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp3, None)
    _tmp7 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + 0
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp4 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_last')
        tmp5 = tmp4 - tmp3
        tmp6 = tmp5 * tmp5
        _tmp7 = tl.where(rmask, _tmp7 + tmp6, _tmp7)
    tmp7 = tl.sum(_tmp7, 1)[:, None]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94139
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-02-06 22:34:16 +00:00
719f78d311 [inductor] Count bytes can't read from buffers that are never written (#94142)
If a buffer is never materialized, it follows that it will never be read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94142
Approved by: https://github.com/jansel
2023-02-06 22:34:16 +00:00
43f6ed4abd Extend torch-trition conda to 3.11 (#93117)
Also drop 3.7 from both builds and add proper names to the steps
Add `pytorch-nightly` for `conda` builds to test the installation against `pytorch` from the nightly channel as well as get [`filelock`](https://anaconda.org/pytorch-nightly/filelock) dependency for 3.11)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93117
Approved by: https://github.com/atalman
2023-02-06 22:14:57 +00:00
cyy
3c6bc58f63 use C10_API in libc10.so (#94171)
MSVC emits several C4273 warning  when compiling c10. I think the offending files should use C10_API instead of TORCH_API. If the tests pass, the changes should be safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94171
Approved by: https://github.com/Skylion007
2023-02-06 20:16:22 +00:00
a07d1291cf Re-enable compilation tests (#92333)
As CUDA-11.5 is no longer supported, just remove the check

Fixes https://github.com/pytorch/pytorch/issues/69460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92333
Approved by: https://github.com/atalman
2023-02-06 20:06:12 +00:00
180adf8c18 Fix bug in generic_list_compare (#94156)
https://github.com/pytorch/pytorch/pull/94054 introduced a bug in list
comparisons other than `==`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94156
Approved by: https://github.com/voznesenskym
2023-02-06 19:50:04 +00:00
fdebc06242 Point to scatter_reduce for reduce argument in scatter_ docs (#94081)
Fix in response to https://github.com/pytorch/pytorch/issues/22378#issuecomment-1411636451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94081
Approved by: https://github.com/cpuhrsch
2023-02-06 19:26:21 +00:00
05397b1250 Make linter quick-checks setup steps retryable (#94199)
We've been seeing linter failures when the `apt-get install doxygen` command fails to install due to network errors, and the workflow doesn't get retried since it's in a non-retryable step

This PR moves it to a retryable step

It also marks a deterministic step as nonretryable, since retrying that one will never change the output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94199
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-02-06 18:44:41 +00:00
496c0a207b Make segment_reduce properly private. (#93166)
I am attempting not to change the aten function to reduce the amount of BC issues on the torchscript side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93166
Approved by: https://github.com/ngimel
2023-02-06 18:32:23 +00:00
9b3277c095 Make sure to properly pull the right submodule in BC test (#94182)
To unblock https://github.com/pytorch/pytorch/pull/93219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94182
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007
2023-02-06 18:03:35 +00:00
0444b8f560 Revert "Support neg calls to dyn shapes (#94068)"
This reverts commit 9350bcf6ae9d646389a0a4345c48275d4f9e4d1a.

Reverted https://github.com/pytorch/pytorch/pull/94068 on behalf of https://github.com/malfet due to This broke hugging_face shard, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_huggin
2023-02-06 17:50:10 +00:00
9b2e7d3b4f [Inductor] Performance smoke test - hf bert performance increased (#94088)
therefore bumping up from 1.185 to 1.200 to better detect regression
logurl date                          model                             speedup
https://ossci-raw-job-status.s3.amazonaws.com/log/11101705328	2023-02-03T23:05:19.5738026Z hf_Bert                            1.2122
https://ossci-raw-job-status.s3.amazonaws.com/log/11101331469	2023-02-03T22:54:18.0252738Z hf_Bert                            1.2129
https://ossci-raw-job-status.s3.amazonaws.com/log/11101288841	2023-02-03T22:52:17.6331332Z hf_Bert                            1.2189
https://ossci-raw-job-status.s3.amazonaws.com/log/11101190372	2023-02-03T22:50:28.6010460Z hf_Bert                            1.2117
https://ossci-raw-job-status.s3.amazonaws.com/log/11101101525	2023-02-03T22:27:18.5573576Z hf_Bert                            1.2088
https://ossci-raw-job-status.s3.amazonaws.com/log/11101034545	2023-02-03T22:24:33.8710157Z hf_Bert                            1.2229
https://ossci-raw-job-status.s3.amazonaws.com/log/11101004878	2023-02-03T22:22:38.0506379Z hf_Bert                            1.2074
https://ossci-raw-job-status.s3.amazonaws.com/log/11100834787	2023-02-03T22:12:34.9376779Z hf_Bert                            1.2142
https://ossci-raw-job-status.s3.amazonaws.com/log/11100413479	2023-02-03T21:47:55.7536822Z hf_Bert                            1.2112
https://ossci-raw-job-status.s3.amazonaws.com/log/11100372087	2023-02-03T21:46:19.6411599Z hf_Bert                            1.2175
https://ossci-raw-job-status.s3.amazonaws.com/log/11100291417	2023-02-03T21:41:01.3427726Z hf_Bert                            1.2068
https://ossci-raw-job-status.s3.amazonaws.com/log/11100137256	2023-02-03T21:32:14.4491714Z hf_Bert                            1.2089
https://ossci-raw-job-status.s3.amazonaws.com/log/11098980986	2023-02-03T20:30:13.4082966Z hf_Bert                            1.2109
https://ossci-raw-job-status.s3.amazonaws.com/log/11098634747	2023-02-03T20:12:57.4921305Z hf_Bert                            1.2169
https://ossci-raw-job-status.s3.amazonaws.com/log/11096295932	2023-02-03T18:58:55.1214750Z hf_Bert                            1.2196
https://ossci-raw-job-status.s3.amazonaws.com/log/11095904757	2023-02-03T18:49:48.4541355Z hf_Bert                            1.22
https://ossci-raw-job-status.s3.amazonaws.com/log/11095292402	2023-02-03T18:10:54.6924201Z hf_Bert                            1.2122
https://ossci-raw-job-status.s3.amazonaws.com/log/11095026691	2023-02-03T18:11:26.7384107Z hf_Bert                            1.2228
https://ossci-raw-job-status.s3.amazonaws.com/log/11094943489	2023-02-03T17:53:00.0989341Z hf_Bert                            1.2165
https://ossci-raw-job-status.s3.amazonaws.com/log/11093227145	2023-02-03T16:04:18.7935799Z hf_Bert                            1.2208
https://ossci-raw-job-status.s3.amazonaws.com/log/11092910912	2023-02-03T15:51:28.1977577Z hf_Bert                            1.2188
https://ossci-raw-job-status.s3.amazonaws.com/log/11091775528	2023-02-03T15:27:21.7984395Z hf_Bert                            1.2231
https://ossci-raw-job-status.s3.amazonaws.com/log/11091768252	2023-02-03T15:12:33.0339859Z hf_Bert                            1.2167
https://ossci-raw-job-status.s3.amazonaws.com/log/11091051563	2023-02-03T14:44:42.7011287Z hf_Bert                            1.2214
https://ossci-raw-job-status.s3.amazonaws.com/log/11088539227	2023-02-03T12:41:29.9098435Z hf_Bert                            1.2192
https://ossci-raw-job-status.s3.amazonaws.com/log/11088428613	2023-02-03T12:35:38.4674850Z hf_Bert                            1.2108
https://ossci-raw-job-status.s3.amazonaws.com/log/11088405279	2023-02-03T12:34:54.0870617Z hf_Bert                            1.2197
https://ossci-raw-job-status.s3.amazonaws.com/log/11087037337	2023-02-03T12:06:58.2426787Z hf_Bert                            1.2174
https://ossci-raw-job-status.s3.amazonaws.com/log/11085381881	2023-02-03T10:19:20.8764019Z hf_Bert                            1.2189
https://ossci-raw-job-status.s3.amazonaws.com/log/11085190037	2023-02-03T10:14:41.5234245Z hf_Bert                            1.2046
https://ossci-raw-job-status.s3.amazonaws.com/log/11085016390	2023-02-03T09:50:59.7484273Z hf_Bert                            1.2155
https://ossci-raw-job-status.s3.amazonaws.com/log/11084948754	2023-02-03T09:47:15.7358069Z hf_Bert                            1.2083
https://ossci-raw-job-status.s3.amazonaws.com/log/11084675155	2023-02-03T09:42:35.6628268Z hf_Bert                            1.2126
https://ossci-raw-job-status.s3.amazonaws.com/log/11081270865	2023-02-03T06:05:22.1828269Z hf_Bert                            1.2083
https://ossci-raw-job-status.s3.amazonaws.com/log/11081252914	2023-02-03T05:43:59.0680872Z hf_Bert                            1.2097
https://ossci-raw-job-status.s3.amazonaws.com/log/11081252670	2023-02-03T05:44:17.0945428Z hf_Bert                            1.2143
https://ossci-raw-job-status.s3.amazonaws.com/log/11081244430	2023-02-03T05:43:43.6811750Z hf_Bert                            1.2204
https://ossci-raw-job-status.s3.amazonaws.com/log/11081191493	2023-02-03T05:38:43.7833293Z hf_Bert                            1.2079
https://ossci-raw-job-status.s3.amazonaws.com/log/11081191168	2023-02-03T05:38:21.1397044Z hf_Bert                            1.2067
https://ossci-raw-job-status.s3.amazonaws.com/log/11081189846	2023-02-03T05:38:53.5914557Z hf_Bert                            1.2073
https://ossci-raw-job-status.s3.amazonaws.com/log/11080883297	2023-02-03T05:13:25.0077772Z hf_Bert                            1.2105
https://ossci-raw-job-status.s3.amazonaws.com/log/11080456108	2023-02-03T04:34:34.0934838Z hf_Bert                            1.204
https://ossci-raw-job-status.s3.amazonaws.com/log/11079957300	2023-02-03T03:53:18.9091026Z hf_Bert                            1.207
https://ossci-raw-job-status.s3.amazonaws.com/log/11078579407	2023-02-03T02:03:11.2254812Z hf_Bert                            1.2049
https://ossci-raw-job-status.s3.amazonaws.com/log/11078204621	2023-02-03T01:58:39.0887941Z hf_Bert                            1.2214
https://ossci-raw-job-status.s3.amazonaws.com/log/11078126527	2023-02-03T01:38:20.2183225Z hf_Bert                            1.2061
https://ossci-raw-job-status.s3.amazonaws.com/log/11077409013	2023-02-03T00:48:51.8981496Z hf_Bert                            1.2086
https://ossci-raw-job-status.s3.amazonaws.com/log/11077176061	2023-02-03T00:27:27.2594172Z hf_Bert                            1.2077
https://ossci-raw-job-status.s3.amazonaws.com/log/11077075809	2023-02-03T00:21:54.4916449Z hf_Bert                            1.2103
https://ossci-raw-job-status.s3.amazonaws.com/log/11076629886	2023-02-02T23:50:38.3512367Z hf_Bert                            1.2191
https://ossci-raw-job-status.s3.amazonaws.com/log/11076577074	2023-02-02T23:46:06.5987589Z hf_Bert                            1.2061
https://ossci-raw-job-status.s3.amazonaws.com/log/11076403972	2023-02-02T23:35:49.7931367Z hf_Bert                            1.2088
https://ossci-raw-job-status.s3.amazonaws.com/log/11076234469	2023-02-02T23:25:55.7300688Z hf_Bert                            1.2099
https://ossci-raw-job-status.s3.amazonaws.com/log/11075752070	2023-02-02T22:57:25.4280216Z hf_Bert                            1.2048
https://ossci-raw-job-status.s3.amazonaws.com/log/11074434992	2023-02-02T22:10:58.4127805Z hf_Bert                            1.2084
https://ossci-raw-job-status.s3.amazonaws.com/log/11074370082	2023-02-02T22:10:06.8153498Z hf_Bert                            1.2075
https://ossci-raw-job-status.s3.amazonaws.com/log/11073914614	2023-02-02T21:25:53.3262334Z hf_Bert                            1.2058
https://ossci-raw-job-status.s3.amazonaws.com/log/11073616418	2023-02-02T21:12:03.0024412Z hf_Bert                            1.2053
https://ossci-raw-job-status.s3.amazonaws.com/log/11072632121	2023-02-02T20:25:37.5689220Z hf_Bert                            1.2082
https://ossci-raw-job-status.s3.amazonaws.com/log/11072091471	2023-02-02T20:00:08.5175281Z hf_Bert                            1.2079
https://ossci-raw-job-status.s3.amazonaws.com/log/11069395867	2023-02-02T18:29:04.6481423Z hf_Bert                            1.2071
https://ossci-raw-job-status.s3.amazonaws.com/log/11069169921	2023-02-02T18:18:36.5701242Z hf_Bert                            1.2036
https://ossci-raw-job-status.s3.amazonaws.com/log/11069070631	2023-02-02T18:15:32.2345859Z hf_Bert                            1.2055
https://ossci-raw-job-status.s3.amazonaws.com/log/11067153829	2023-02-02T16:38:27.4201129Z hf_Bert                            1.2133
https://ossci-raw-job-status.s3.amazonaws.com/log/11066885021	2023-02-02T16:28:44.4489971Z hf_Bert                            1.2043

The above are the result of running a rockset query which returns links to the log and wget the logs and grep "Z hf_Bert"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94088
Approved by: https://github.com/desertfire
2023-02-06 17:48:09 +00:00
d2b82feb41 Don't compare ids of temporary python objects (#94097)
Since `.data` creates a new Tensor and thus a new python object, this check checks the id of temporary objects and thus always succeed given the current behavior of python's allocator:
```
>>> import torch
>>> print(id(torch.rand(2)) == id(torch.rand(3)))
True
```

I change it here to make sure they look at the same memory.
If you want to check that they are the same python object, I can change it to `is`. Let me know!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94097
Approved by: https://github.com/malfet
2023-02-06 16:30:20 +00:00
25a6e0fd79 Fix serialization (#94096)
We now always have a `__getstate__`/`__setstate__` pair AND the `__dict__` attribute is lazily initialized. So we need to support that in our serialization code.
A quick audit of the rest doesn't look like the new `__getstate__` is too problematic. But maybe the test suite will bring more things to light.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94096
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-06 16:30:20 +00:00
db011e11ea Skip sebotnet33ts_256 on CI (#94067)
Summary: Random failure on CI and it happens more frequently lately.
Skip for now and filed an issue at https://github.com/pytorch/pytorch/issues/94066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94067
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-06 14:58:54 +00:00
16387bee4a [DCP] Fix test_file_system_checkpoint.py and test_file_system_checkpoint_cpu.py (#94069)
This fixes the typo in assert that would always return True and adds missing import.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94069
Approved by: https://github.com/kumpera
2023-02-06 13:56:07 +00:00
819990f595 [decomp] Decompose std/std_mean into aten.var/var_mean (#94072)
These are currently decomposed into prims.var which is less useful for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94072
Approved by: https://github.com/lezcano
2023-02-06 10:22:07 +00:00
26cba842ad Optimize ConvTransposed2D with mkldnn float32 and bfloat16 on CPU (#92530)
this PR optimized `ConvTranspose2d` with oneDNN and add channels last support for it. Also the fallback path `slow_conv_transpose2d` also have channels last support. So the memory format propagation behavior would stay the same with or without oneDNN.

Replacement of https://github.com/pytorch/pytorch/pull/77060, https://github.com/pytorch/pytorch/pull/70897 and https://github.com/pytorch/pytorch/pull/74023 which enables oneDNN for `ConvTranspose2d` and `ConvTranspose3d`

The following results collects on Skylake Xeon 8180, dual sockets, 28 cores per socket.
### single core channels last

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 181.36 | 91.16 | 1.99 | 531.38 | 124.08 | 4.28
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 324.35 | 153.50 | 2.11 | 973.16 | 185.97 | 5.23
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 1086.82 | 671.52 | 1.62 | 3008.94 | 1453.33 | 2.07

### single core channels first

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 138.10 | 5.94 | 23.23 | 37.97 | 11.25 | 3.38
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 236.43 | 8.75 | 27.03 | 87.77 | 18.58 | 4.72
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 484.39 | 37.69 | 12.85 | 185.40 | 90.57 | 2.05

### single socket channels last

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100, 100), weight size: (32, 32, 3, 3) | 138.10 | 5.94 | 23.23 | 37.97 | 11.25 | 3.38
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 236.43 | 8.75 | 27.03 | 87.77 | 18.58 | 4.72
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 484.39 | 37.69 | 12.85 | 185.40 | 90.57 | 2.0

### single socket channels first

configs | forward before/ms | forward after/ms | ratio | backward   before/ms | backward after/ms | ratio
-- | -- | -- | -- | -- | -- | --
input size: (32, 32, 100,   100), weight size: (32, 32, 3, 3) | 132.56 | 7.19 | 18.43 | 31.43 | 11.20 | 2.81
input size:   (32, 16, 200, 200), weight size: (16, 16, 3, 3) | 227.94 | 13.33 | 17.11 | 63.00 | 23.41 | 2.69
input size:   (32, 128, 100, 100), weight size: (128, 128, 3, 3) | 473.68 | 52.79 | 8.97 | 150.40 | 87.33 | 1.72

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92530
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-02-06 10:11:25 +00:00
f3bf46e801 enable bf16 emb (#94163)
Merge https://github.com/pytorch/pytorch/pull/89199 and https://github.com/pytorch/pytorch/pull/91949 into one PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94163
Approved by: https://github.com/jianyuh, https://github.com/malfet, https://github.com/jgong5
2023-02-06 07:11:40 +00:00
ea4cda5268 fix inductor clamp decomp to correctly type promote and avoid wrappin… (#94157)
…g scalars

Fixes #93784, #93225
Ideally, clamp decomp should live in refs or _decomp, but this reversed our current decomposition flow of `clamp_min` -> `clamp` -> lowering, so to keep changes to minimum, I'm leaving it in inductor for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94157
Approved by: https://github.com/ezyang
2023-02-06 05:36:19 +00:00
9350bcf6ae Support neg calls to dyn shapes (#94068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94068
Approved by: https://github.com/jansel
2023-02-05 21:38:16 +00:00
7b6e948812 Add missing move to torch_dispatch_mode.h (#94154)
Removes an unnecessary copy from torch_dispatch_mode.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94154
Approved by: https://github.com/ezyang
2023-02-05 20:43:30 +00:00
10a1efb49f [MPS] Fix cumsum for negative indexes (#94119)
Use `wrap_dim` to get dim in range or range IndexError

Add test to test for that

Addresses feedback raised in https://github.com/pytorch/pytorch/pull/88319#issuecomment-1403541180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94119
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2023-02-05 18:21:29 +00:00
60a3b7425d Small refactor of shape guards to allow for 1:1 code_parts (#93894)
By moving guard string assembly into dynamo's default behavior and letting code_parts do the work, we can have much better shape guard failures.

Before this fix, the guard failure in the test would look like:

```
'x.size()[1] == x.size()[0] and x.stride()[0] == x.[264 chars]!= 1' != 'x.size()[0] < 3'
- x.size()[1] == x.size()[0] and x.stride()[0] == x.size()[0] and x.stride()[1] == 1 and x.storage_offset() == 0 and y.size()[0] == x.size()[0] and y.size()[1] == x.size()[0] and y.stride()[0] == x.size()[0] and y.stride()[1] == 1 and y.storage_offset() == 0 and x.size()[0] < 3 and x.size()[0] != 0 and x.size()[0] != 1
+ x.size()[0] < 3
```
now it is
```
"x.size()[0] < 3"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93894
Approved by: https://github.com/ezyang
2023-02-05 09:24:12 +00:00
8a88852d5f [MPS] Fix index_select for empty input (#94117)
Also add test for this case to `test_index_select`
Fixes https://github.com/pytorch/pytorch/issues/93877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94117
Approved by: https://github.com/orionr
2023-02-05 05:45:57 +00:00
8ecda19607 fix upsampling decompositions to have integer output sizes (#94123)
This allows unet to be compiled with symbolic shapes (but it still fails accuracy, lol).
Output sizes are always integer, there's no need to pretend they are ever float. Recomputing scale factors still used nominally float sizes converted to int, we might as well do it from the start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94123
Approved by: https://github.com/ezyang
2023-02-05 04:56:07 +00:00
2362b5fca3 [Dynamo] Put torch.cuda.stream into Dynamo FX graph (#93808)
Fixes #92804

This PR only handles ```torch.cuda.stream```. If this is a right direction, I'll add support for several relevant functions, e.g, ```torch.cuda.current_stream().wait_stream(s)```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93808
Approved by: https://github.com/jansel
2023-02-05 04:52:43 +00:00
25c0737adc dont graph break on list[SymInt] comparisons (#94054)
Reland of https://github.com/pytorch/pytorch/pull/92617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94054
Approved by: https://github.com/jansel
2023-02-05 04:47:12 +00:00
1d53123f44 Report graph breaks separately from graph count (#94143)
graph break != graph count - 1.  Suppose you have a nested
inline function call f1 to f2 to f3.  A graph break in f3
results in six graphs: f1 before, f2 before, f3 before, f3 after,
f2 after, f1 after.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94143
Approved by: https://github.com/voznesenskym
2023-02-05 04:03:12 +00:00
a2db70b3c7 Add graphs/ops to parse_logs.py (#94138)
Also remove broken stats parsing logic.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94138
Approved by: https://github.com/voznesenskym
2023-02-05 04:03:12 +00:00
9895c19a7a To vectorize long datatype as mask index (#91076)
In this PR, we record the current fx node being executed to cache additional information to simply the vectorization checker. In addition, we supported `masked` in this PR by simplifying it as `mask_load` to support `max_pool2d`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91076
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-05 03:36:22 +00:00
834e8f0464 Hack SymInt.__iadd__ to be working. (#94136)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94136
Approved by: https://github.com/Skylion007
2023-02-04 21:17:36 +00:00
c1da35af5e Update dynamic benchmark skips (#94114)
Data from https://github.com/pytorch/pytorch/pull/94134

Signed-off-by: Edward Z. Yang <ezyangmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94114
Approved by: https://github.com/SherlockNoMad
2023-02-04 20:36:51 +00:00
3693039bb7 perf: fix missing noexcepts on minpybind in functorch (#94135)
Noticed this performance bug in functorch. We got a pretty big perf in pybind11 improvement by explicitly marking at noexcept, see https://quuxplusone.github.io/blog/2022/08/26/vector-pessimization/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94135
Approved by: https://github.com/ezyang
2023-02-04 20:07:15 +00:00
f54fd6fb28 [c10d] Update get_backend() in exception_handler (#94063)
Currently, get_backend() and get_world_size() would always return the default value if no pg group argument is passed. This fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94063
Approved by: https://github.com/H-Huang
2023-02-04 19:39:36 +00:00
8c26ed5f5e Add lowerings for all symbolic shape operators (#94121)
In particular, this fixes the missing negative problem.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94121
Approved by: https://github.com/ngimel
2023-02-04 12:57:22 +00:00
cyy
afd7b581aa Simplify OpenMP detection in CMake (#91576)
We greatly simplify the handing of OpenMP in CMake by using caffe2::openmp target thoroughly. We follow the old behavior by defaulting to MKL OMP library and detecting OMP flags otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91576
Approved by: https://github.com/malfet
2023-02-04 11:50:06 +00:00
d4a93eadee tools: Add lint for CONSTEXPR (#94089)
Adds a lint for CONSTEXPR to have us prefer to use macros for cuda files to support VS2017 compilations on windows internally (Meta)

Follow up to https://github.com/pytorch/pytorch/pull/94091

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94089
Approved by: https://github.com/malfet
2023-02-04 11:25:30 +00:00
996cc1c0d0 Fix Win+CUDA builds using VS2017 (#94091)
Summary:
Followup after https://github.com/pytorch/pytorch/pull/93267
Generated by running:
```
for i in *.cu; do sed -i -e "s/constexpr char/CONSTEXPR_EXCEPT_WIN_CUDA char/" $i; done
```

Otherwise, attempts to compile using VS-15.9 results in:
```
D:\pytorch\aten\src\aten\native\cuda\laguerre_polynomial_l.cu(17): fatal error C1001: An internal error has occurred in the compiler.
(compiler file 'msc1.cpp', line 1518)
 To work around this problem, try simplifying or changing the program near the locations listed above.
Please choose the Technical Support command on the Visual C++
 Help menu, or open the Technical Support help file for more information
Internal Compiler Error in D:\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64\cl.exe.  You will be prompted to send an error report to Microsoft later.
INTERNAL COMPILER ERROR in 'D:\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64\cl.exe'
    Please choose the Technical Support command on the Visual C++
    Help menu, or open the Technical Support help file for more information

```

Test Plan: CI

Differential Revision: D43011140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94091
Approved by: https://github.com/seemethere
2023-02-04 08:22:49 +00:00
2064fa9f10 Clean-up removed TH from BUCK (#94022)
Differential Revision: D42981979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94022
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb, https://github.com/malfet
2023-02-04 08:16:43 +00:00
7fb2ac2bd5 Revert "trymerge to ignore certain failures (#91134)"
This reverts commit 8b7bd5dffccf342cacae510d6c5a6ca2665770b7.

Reverted https://github.com/pytorch/pytorch/pull/91134 on behalf of https://github.com/seemethere due to Breaks internal `github-export-checks` see failure: https://fburl.com/sandcastle/ggqj29pz
2023-02-04 08:08:32 +00:00
170a3e0257 Enable Python dispatcher on inference-only aot_dispatch_base (#94118)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94118
Approved by: https://github.com/voznesenskym
2023-02-04 06:10:21 +00:00
4207d3c330 FusedAdam(W) should take OptState into account before unscaling grads (#94060)
the optimizers have to consult `OptState` before unscaling gradients because we could call `GradScaler.unscale_` explicitly to for e.g. `clip_grad_norm_` as mentioned in e52786f3d1/torch/cuda/amp/grad_scaler.py (L235-L266) and https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-unscaled-gradients

Related #90752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94060
Approved by: https://github.com/albanD
2023-02-04 05:20:13 +00:00
adde6fd25e [dynamo 3.11] update instruction sizes (#93984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93984
Approved by: https://github.com/jansel, https://github.com/albanD, https://github.com/malfet, https://github.com/mlazos
2023-02-04 04:09:24 +00:00
11de399447 [inductor] fix cpu implement of torch.neg (#94035)
Fixes #93380

Fix to maintain the data type after doing neg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94035
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-04 03:13:11 +00:00
cyy
1a32db15e7 Some performance fixes (#94034)
Applies some performance fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94034
Approved by: https://github.com/Skylion007
2023-02-04 02:17:48 +00:00
cyy
fa65ae8f56 cleanup unused include (#93359)
Using `include-what-you-use` tool to find out and remove some unused includes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93359
Approved by: https://github.com/malfet
2023-02-04 02:15:50 +00:00
cyy
27efdc5eed fix writable-strings warnings (#93246)
clang reports "ISO C++11 does not allow conversion from string
literal to 'char *'"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93246
Approved by: https://github.com/malfet
2023-02-04 02:11:15 +00:00
59a81b695a Fix flaky linter clang-tidy relative path (#94093)
There are some occurrences when clang-tidy linter fails flakily with the following error, which is very weird:

```
>>> Lint for FILE:
  Error (CLANGTIDY) command-failed
    Failed due to FileNotFoundError:
    [Errno 2] No such file or directory: '.lintbin/clang-tidy'
```

For examples,

* 0a93e6db5a
* 203b2cad3e

The binary is definitely there as the log shows that it has been downloaded successfully from S3.  Looking a bit closer, I notice that the linter uses `os.chdir` to jump around between the workspace and the build folder.  And it also refers to the binary with the relative path `.lintbin/clang-tidy` which doesn't exist in the latter.  AFAIK, the current working directory is per process (https://stackoverflow.com/questions/16388400/what-is-a-thread-specific-os-chdir-and-mkdir-in-python), so I suspect that there is a race here where one thread chdir into build while another thread tries to lint another file.  Thus the fix to use the absolute path to clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94093
Approved by: https://github.com/malfet
2023-02-04 02:05:38 +00:00
e071d72f3c Tag dynamo backends as debug/experimental (#93878)
Hides debug/experimental backends by default.

Before:
```
torch._dynamo.list_backends()
['aot_eager', 'aot_eager_decomp_partition', 'aot_torchxla_trace_once', 'aot_torchxla_trivial', 'aot_ts', 'aot_ts_nvfuser', 'cudagraphs', 'dynamo_accuracy_minifier_backend', 'dynamo_minifier_backend', 'eager', 'inductor', 'ipex', 'nvprims_aten', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'torchxla_trace_once', 'torchxla_trivial', 'ts', 'tvm']
```

After:
```
torch._dynamo.list_backends()
['aot_ts_nvfuser', 'cudagraphs', 'inductor', 'ipex', 'nvprims_nvfuser', 'onnxrt', 'tensorrt', 'tvm']
```

Fixes https://github.com/pytorch/pytorch/issues/93733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93878
Approved by: https://github.com/voznesenskym
2023-02-04 00:50:51 +00:00
5c7f4534e9 [small] multithreaded-pg guard attr (#93883)
currently the test
```
pytest test/distributed/test_multi_threaded_pg.py -vs
```

has errors

```
Traceback (most recent call last):
  File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/private/home/howardhuang/.conda/envs/pytorch/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/private/home/howardhuang/pytorch-projects/pytorch/torch/testing/_internal/common_distributed.py", line 1029, in _run
    self._tls.precision = TestCase._precision
AttributeError: 'TestCollectivesWithBaseClass' object has no attribute '_tls'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93883
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-02-03 23:01:02 +00:00
6d597c532e [ROCm] Add diskspace check for rocm CI nodes (#93032)
Fixes #92822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93032
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-02-03 22:38:57 +00:00
ef156f9136 Enable retry support for MPS tests (#94070)
Here is an example d7c71a95b6 where the MPS test was flaky but not retried.  Thus it failed.  We probably would want to support retry on MPS tests like the rest of the CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94070
Approved by: https://github.com/clee2000
2023-02-03 22:21:31 +00:00
3c79ea2607 Removes stray print (#94079)
Pertitle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94079
Approved by: https://github.com/voznesenskym
2023-02-03 21:56:45 +00:00
dfac113cfc Remove torch/_dynamo/optimizations (#93871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93871
Approved by: https://github.com/voznesenskym
2023-02-03 21:54:28 +00:00
5f4fec7459 Fix/refactor dynamo tvm backend (#93870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93870
Approved by: https://github.com/shingjan, https://github.com/desertfire
2023-02-03 21:48:31 +00:00
0a93e6db5a Fix/refactor dynamo ipex backend (#93863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93863
Approved by: https://github.com/desertfire
2023-02-03 21:42:27 +00:00
5197496799 Add a private API banner (#93996)
Add a banner that will appear on all pages where the last segment of the URL starts with an underscore "_".
Example pages:
* https://pytorch.org/docs/master/_dynamo.html
* https://pytorch.org/docs/master/_modules/torch/_jit_internal.html
Sample screenshots:
<img width="885" alt="Screenshot 2023-02-03 at 1 13 47 PM" src="https://user-images.githubusercontent.com/5317992/216711948-6ba35d38-da8f-4145-9580-bafc921a1df5.png">
<img width="871" alt="Screenshot 2023-02-03 at 1 12 51 PM" src="https://user-images.githubusercontent.com/5317992/216711951-877a760e-3449-4593-b81c-14bf3b9943da.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93996
Approved by: https://github.com/malfet, https://github.com/albanD
2023-02-03 21:40:15 +00:00
1c30268ff1 Update rockset version (#94005)
upgrading rockset to 1.0.3

the diff looks like it gets rid of dependency on six but i think python-dateutils still uses it but is better about downloading it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94005
Approved by: https://github.com/huydhn
2023-02-03 21:38:35 +00:00
5be57d51f9 Fix testing now that random.sample() arg must be a sequence (#94052)
This is only enforced in 3.11 but the change is not bad for other versions either (and this is test code so perf is not a concern).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94052
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-03 21:28:02 +00:00
8051f8a6ee Fix Storage destruction GC tracking (#94051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94051
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-03 21:28:02 +00:00
203b2cad3e Remove fx2trt/torch2trt backends (#93822)
These backends have been broken for some time.  I tried to get them
running again, but as far as I can tell they are not maintained.
Installing torch_tensorrt downgrades PyTorch to 1.12.  If I manually
bypass that downgrade, I get import errors from inside fx2trt.  Fixes that
re-add these are welcome, but it might make sense to move these wrappers
to the torch_tensorrt repo once PyTorch 2.0 support is added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93822
Approved by: https://github.com/frank-wei
2023-02-03 21:04:21 +00:00
5d709af59a Rename aot_cudagraphs to cudagraphs (#93821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93821
Approved by: https://github.com/ezyang
2023-02-03 21:01:27 +00:00
8b7bd5dffc trymerge to ignore certain failures (#91134)
For any failure in dr ci listed as "flaky" or "broken trunk" (aka anything not "new failures"), these get marked as "ok to fail".

If there are a small number (currently set to 3) ok to fail jobs, merge can still continue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91134
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-02-03 20:56:39 +00:00
a5ff40032d Fix/refactor dynamo onnxrt backend (#93818)
Fixes https://github.com/pytorch/pytorch/issues/90352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93818
Approved by: https://github.com/voznesenskym
2023-02-03 20:48:02 +00:00
d9870d70c1 Exempt _foreach_norm from autograd_not_implemented_fallback check (#93995)
Fixes #93940
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93995
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-02-03 19:45:46 +00:00
dc7bf1a7ea General reversible binary op support (e.g. __add__ / __radd__) in dynamo (#93271)
Generic support for reversible binary op pairs (e.g. `__add__` / `__radd__`) in dynamo.
Adds logic to flip args and try the reverse op when the forward op is unsupported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93271
Approved by: https://github.com/voznesenskym, https://github.com/jansel, https://github.com/ezyang
2023-02-03 19:28:35 +00:00
e52786f3d1 Silence profiler error (#94013)
This is not 3.11 specific but a lot more likely in 3.11 I guess
You can find other reports at https://github.com/pytorch/pytorch/issues/64345 as well for it failing in 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94013
Approved by: https://github.com/malfet
2023-02-03 17:33:47 +00:00
a0fc90b07f Add TorchData for regular cleanup of anaconda pytorch-nightly channel (#94014)
Fixes https://github.com/pytorch/test-infra/issues/1413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94014
Approved by: https://github.com/ejguan, https://github.com/malfet
2023-02-03 17:13:58 +00:00
3b7140d938 Add the new submission form (#94000)
Adding the new form for submitting topics on quarterly maintainers meetings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94000
Approved by: https://github.com/orionr
2023-02-03 16:46:30 +00:00
6650aac8ce move more operators to BatchRulesDecompositions (#93164)
Moving operators over to `BatchRulesDecompositions.cpp` to remove xfails. I noticed that composite-compliant does not mean inductor or vmap compliant, so I added more `isTensorSubclassLike` checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93164
Approved by: https://github.com/lezcano, https://github.com/kshitij12345
2023-02-03 16:36:05 +00:00
6e1e212c39 [platform010] remove more ovr_config//runtime:platform009 usage (#93008)
Summary: WTTS

Test Plan: ci

Reviewed By: akrieger

Differential Revision: D42729966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93008
Approved by: https://github.com/kit1980
2023-02-03 16:32:04 +00:00
6c555b29a8 MHA optimizations (#93234)
Slight perf optimizations for regular MHA by reducing the number of kernels called

Before:
![image](https://user-images.githubusercontent.com/30204471/215349212-172c6364-9e3c-4fd1-92b6-8ddd9931613e.png)

After:
![image](https://user-images.githubusercontent.com/30204471/215349247-021dd9e6-f6ca-40a2-8de8-0805af001f69.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93234
Approved by: https://github.com/drisspg
2023-02-03 15:18:35 +00:00
162e3ca58e [fx] fix type promotion in binary_magic_impl (#91376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91376
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-03 15:06:40 +00:00
34bcbfbd6a [fx] throw exceptions on invalid input in FloorDiv (#93143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93143
Approved by: https://github.com/ezyang
2023-02-03 15:06:40 +00:00
ba614f3a32 [fx] test FloorDiv against Python impl (#93142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93142
Approved by: https://github.com/ezyang
2023-02-03 15:06:38 +00:00
e7c63b962b [fx] add SymPy assumptions to FloorDiv (#93185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93185
Approved by: https://github.com/ezyang
2023-02-03 15:06:36 +00:00
2481fc0df4 Add count to FakeTensorMode.__torch_dispatch__ (#93936)
Most calls to fake tensor never hit `FakeTensor.__torch_dispatch__`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93936
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-02-03 14:21:11 +00:00
12f22655b1 Short circuit device property access on FakeTensor (#93946)
Before:

```
(/home/ezyang/local/a/pytorch-env) [ezyang@devgpu020.ftw1 ~/local/a/pytorch (ab0e3db0)]$ python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18
cuda eval  hrnet_w18                           PASS
TIMING: entire_frame_compile:54.19504 backend_compile:33.86702
STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:72549 | FakeTensorMode.__torch_dispatch__:115542 | ProxyTorchDispatchMode.__torch_dispatch__:3103
```

After

```
(/home/ezyang/local/a/pytorch-env) [ezyang@devgpu020.ftw1 ~/local/a/pytorch (ab0e3db0)]$ python benchmarks/dynamo/timm_models.py --accuracy --timing --backend aot_eager --dynamic-shapes --float32 --only hrnet_w18
cuda eval  hrnet_w18                           PASS
TIMING: entire_frame_compile:53.97591 backend_compile:33.60832
STATS: call_* op count: 1369 | FakeTensor.__torch_dispatch__:4995 | FakeTensorMode.__torch_dispatch__:89985 | ProxyTorchDispatchMode.__torch_dispatch__:3010
```

It doesn't really help end-to-end wall time all that much, but it does cut the number of calls to FakeTensor.__torch_dispatch__ by an order of magnitude, which hopefully has other positive effects.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93946
Approved by: https://github.com/eellison, https://github.com/albanD
2023-02-03 14:20:30 +00:00
77acb556e6 [primTorch] Rewrite nan_to_num ref in terms of aten functions (#93952)
This de-duplicates `_refs.nan_to_num` with the inductor decomposition
and simplifies it to not reimplement `isnan`, `isposinf` and `isneginf`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93952
Approved by: https://github.com/lezcano
2023-02-03 13:51:37 +00:00
72385bbd03 [primTorch] Rewrite is{,pos,neg}inf refs in terms of aten functions (#93951)
`isposinf` and `isneginf` currently fallback in inductor. Here, I
enable the existing decompositions to work with inductor.

`isinf` can also be written with aten functions, however I don't add
it to inductor's decompositions because `isinf` is lowered to
`tl.libdevice.isinf` in triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93951
Approved by: https://github.com/lezcano
2023-02-03 13:51:37 +00:00
6c4dc98b9d [CI][BE] Move docker forlder to .ci (#93104)
Follow up after https://github.com/pytorch/pytorch/pull/92569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93104
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/ZainRizvi
2023-02-03 12:25:33 +00:00
6e1cfcdf4b cauchy_ few fixes (1) check gamma > 0 (2) better dtype error log (#93314)
Related #92047

(1) `torch.Tensor.cauchy_` is missing check for `gamma > 0` (`torch.distributions.cauchy.Cauchy` correctly checks  `gamma > 0`).
(2) add better error log on dtype similar to exponential_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93314
Approved by: https://github.com/jgong5, https://github.com/fritzo, https://github.com/lezcano
2023-02-03 11:56:28 +00:00
d7c71a95b6 [Dynamo] modify IPEX backend (#92067)
1. Combine the two backends ‘ipex_fp32’ and ‘ipex_bf16’ into one backend ‘ipex’.
2. Modify IPEX backend to work in fake mode and symbolic mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92067
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-03 11:50:28 +00:00
aaa27a6b6d Vectorized more stable complex division (#93277)
Fixes #92043 and completing #92539 by implementing the vectorized more stable complex division.
I implement this using the internal `abs_` function to avoid branching. I also re-implement the internal `abs_` to make it more stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93277
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-02-03 11:48:20 +00:00
b41e2779f2 cumsum, cumprod, logcumsumexp: adjust grain size (#94025)
Common issue when paralleling with `TensorIterator`, if the problem size is described as [M, N, K] and [M, N] is reflected in TensorIterator (with K being folded), `grain_size` should also be divided by K.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94025
Approved by: https://github.com/XiaobingSuper
2023-02-03 10:42:27 +00:00
ca8450849b compute dynamic tensor shapes for indexing on the host (#93872)
Hoists computation of some shapes used in triton kernel indexing to the host, so resulting triton code is
```
x1 = (xindex // pks0) % 64
```
instead of
```
x1 = (xindex // (1 + (((((-1) + ks0) // 4))*((((-1) + ks0) // 4))) + (2*((((-1) + ks0) // 4))))) % 64
```
with `pks0` arg computed on the host
```
ps0 = (1 + ((((-1) + s2) // 4)))*(1 + ((((-1) + s2) // 4)))
```
It doesn't work yet for indexing expressions that are directly in the `load` statement, e.g.
```
tmp0 = tl.load(in_ptr0 + (r1 + x0 + (x0*(((((-1) + ks0) // 32))*((((-1) + ks0) // 32)))) + (2*x0*((((-1) + ks0) // 32)))), rmask & xmask, eviction_policy='evict_last').to(tl.float32)
```
Unfortunately, `unet` which is one of the examples failing with floor does the latter:
```
tmp1 = ((-1)*(1/(((-1) + (floor(2.0*(ks0//16))))))) + ((1/(((-1) + (floor(2.0*(ks0//16))))))*(ks0 // 16))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93872
Approved by: https://github.com/jansel
2023-02-03 09:58:39 +00:00
e4f11e01bd [Fake Tensor] Allow fake meta by default, delete unused ctor args (#93993)
Two small changes that I'm bundling together because one of them needs to touch fbcode and I'm not sure how to do stacked diffs + internal changes + land before release cut.

Remove allow_meta from ctor, and allow by default: we should be able to trace through meta with fake tensors, so in some senses it's a bit weird to expose to user to disallow this. However, it's still useful debug wise to error from time to time, so I've added an option to the config that will get back previous behavior.

Remove `throw_on_data_dependent_ops=True`: this was intended as a temporary behavior as we were smoothing things turning on the erroring. There are no uses anywhere of `throw_on_data_dependent_ops=False` I could find.

These are technically backward-incompatble, but fake tensor is new since the last release / in a private namespace, and I don't want to release it with baggage that would be hard to remove later.

Fix for https://github.com/pytorch/pytorch/issues/92877.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93993
Approved by: https://github.com/bdhirsh, https://github.com/ezyang
2023-02-03 09:23:38 +00:00
be364c0cda [Inductor] Fix OpenMP discovery on MacOS (#93895)
It's not available as system dependency, so assume that it is installed
using Anaconda

Also, clang on MacOS does not recognize `-fopenmp` flag, but according
to https://mac.r-project.org/openmp/ and local experiments `-Xclang
-fopenmp` always works

Test plan:
Following should run and return true
```python
import torch

def foo(x: torch.Tensor) -> torch.Tensor:
   return torch.sin(x) + torch.cos(x)

if __name__=="__main__":
    x = torch.rand(3, 3)
    x_eager = foo(x)
    x_pt2 = torch.compile(foo)(x)
    print(torch.allclose(x_eager, x_pt2))
```

Skip number of tests that fail on x86 MacOS (for example rsqrt for bool type and   `test_pixel_shuffle_channels_last_cpu` on machines that do not support AVX2)
Tweak few tests to use double precision when running on CPU, as type promotion for accumulator types is broken.

TODO: Fix PyTorch for M1 compilation with OpenMP, bundle `omp.h` into the package and use it instead.
Fixes https://github.com/pytorch/pytorch/issues/90362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93895
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-02-03 09:13:13 +00:00
e98a942399 [PTD] Land 'to_std' utility parser fix #93209 (#94023)
Land https://github.com/pytorch/pytorch/pull/93209 faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94023
Approved by: https://github.com/wz337
2023-02-03 09:04:34 +00:00
63115b70f0 Fixed issue with --diff-branch arg in dynamo benchmarks (#93989)
As @peterbell10 pointed out, it was giving incorrect results for `compression_ratio`
and `compression_latency` when you used `--diff-branch`.

This fixes this by running a separate subprocess for each branch to make sure you are not being affected by run for other branch.

Also added a couple of more significant figures
to numbers in summary table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93989
Approved by: https://github.com/jansel
2023-02-03 08:36:57 +00:00
3df0e26e20 [SDPA] Remove private version and only utilize public version (#94004)
# Summary
Due to internal failures we needed to keep the private call in torch.nn.mha. This PR undoes this change, so that we call the public function and remove the private function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94004
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-02-03 08:12:09 +00:00
d996acfbc2 [XNNPACK] disable ARM_BF16 and ARM_FP16_VECTOR (#94020)
Summary: This is not used and will cause build failure

Test Plan: CI

Differential Revision: D42982023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94020
Approved by: https://github.com/Skylion007, https://github.com/tiandiao123, https://github.com/digantdesai
2023-02-03 05:01:00 +00:00
dd7d47c4ac abstract vectorized reduction utils on CPU (#92284)
This PR abstracts some reduction utils on CPU, which can be shared by multiple reduction operators, such as `scatter_reduce`, `segment_reduce`, `spmm_reduce`.

No functional change or performance change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92284
Approved by: https://github.com/ezyang
2023-02-03 04:59:24 +00:00
79243516f6 collect CPU info with collect_env.py for new issues reporting (#93899)
Add CPU information collection feature to collect_env.py for new issues reporting. This helps us to triage issues on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93899
Approved by: https://github.com/malfet
2023-02-03 04:58:53 +00:00
a71395dd88 [inductor] fix crash issue when input is a view tensor (#90150)
Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-03 04:54:14 +00:00
732a865c1b [vision hash update] update the pinned vision hash (#94016)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94016
Approved by: https://github.com/pytorchbot
2023-02-03 04:21:12 +00:00
d05ec0efeb [dtensor] add split_with_sizes op (#93957)
add the split_with_sizes op, sharing with split op impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93957
Approved by: https://github.com/XilunWu
2023-02-03 04:16:30 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
cyy
dbbcefcd78 remove std::iterator (#93924)
std::iterator is deprecated in C++17, and it is easy to remove it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93924
Approved by: https://github.com/Skylion007
2023-02-03 03:43:48 +00:00
f7bd5d0ccb Revert "[Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#91… (#92402)"
This reverts commit 965f4ea3bac8186b99119e73b9ff00e390a5d28b.

Reverted https://github.com/pytorch/pytorch/pull/92402 on behalf of https://github.com/zhxchen17 due to Caused a regression for an export model.
2023-02-03 03:12:43 +00:00
60e8c766b5 Refactor dynamo training backends (#93409)
This splits training.py into many files and moves them from `dynamo.optimizations.training` to `dynamo.backends.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93409
Approved by: https://github.com/ezyang
2023-02-03 03:07:15 +00:00
f84f89b1c3 ns: add compare_weights API with a single model (#92058)
Summary:

Adds a compare weights NS API using a single model.

Note: this is not intended for wide usage, so testing is limited
to specific functions our customers care about.  The main reason for adding this
is because existing customers of NS are using the old `compare_weights` API,
and we'd like to move everyone to a single-model API style.

Once all the customers are moved over, we can delete all the old NS code.

Test plan:

```
python test/test_quantization.py -k NShadows.test_extract_weights_linear
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92058
Approved by: https://github.com/jerryzh168
2023-02-03 01:17:19 +00:00
660bea10ba add add_loggers implementation using PNP (#91639)
Summary:

This PR reimplements the old `add_loggers(name_a, model_a, name_b, model_b)`
API in a single-model API style, similar to PNP. This allows for memory
efficiency savings of not having to load two models.

Test plan:

```
python test/test_quantization.py -k NShadows.test_add_loggers
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91639
Approved by: https://github.com/jerryzh168
2023-02-03 01:17:19 +00:00
a719bb0e37 Readme: Fix for outdated build-from-source documentation (#91861)
## `pip install -r requirements.txt` in build-from-source documentation

This line
81b5eff3c3/README.md (L182-L188)
Is outdated. Let's default to `requirements.txt`

### My problem
Without touching this codebase for years I'm trying to build repo for local development and run unit tests. I go to `build from source => Contributing.md`. I immediately run into various problems.

* [Contributing.md](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#developing-pytorch) suggests one way of setting up environment different from [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) that does not work for me.
* [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) suggests a different set of dependencies than [`requirements.txt`](https://github.com/pytorch/pytorch/blob/master/requirements.txt), many of which are unnecessary, and there's still missing ones to run unit tests.
* Dependencies in `requirements.txt` are needed to run unit tests

So there's competing, inlined and outdated equally confident recommendations on how to set up. https://github.com/pytorch/pytorch/pull/91850 tries to remove one recommendation, this PR tries to make the default one simpler.

### Goals
* Improve society somewhat 😁
* Remove a dead end roundtrip in the developer onboarding funnel
* Update a duplicated & outdated line of documentation
* Two broken things => one broken thing
* Improve doc maintainability and nudge us to a productive discussion of what `requirements.txt` is there for.

### Non-goals
* Give a definite recommendation how to set up your machine for local development. I read the instructions in readme at this moment as an outline on how to do it.
* Say that `requirements.txt` is a definite guide to dependencies, I know it's not (but probably should be)

### Background
* Dependency handling/reproducibility in this repo is tricky! See geist of [this](fdbbd20f32/.github/requirements/README.md). There's many different sets of dependencies with different setups for different environments.
* There's been great attempts of _"one requirements.txt to rule them all"_ which got halted https://github.com/pytorch/pytorch/pull/60697/ see https://github.com/pytorch/pytorch/issues/61375
* The unofficial `requirements.txt` file seem to be .circleci/docker/requirements-ci.txt https://github.com/pytorch/pytorch/issues/72556
* Unofficial _"how to build from source"_ docs seem to be here https://github.com/pytorch/pytorch/tree/master/.circleci#how-to-build-a-binary-locally

### Considered alternatives
* a) Point only to python dependencies in `requirements.txt` **(Chosen option)**
```
conda install cmake ninja
pip install -r requirements.txt
```
This guarantees `python setup.py` to run (on my machine) and gets me one step closer to be able to `python test/run_test.py`
* b) Only add whats needed to `python setup.py install`. Point to `Contributing.md` for explanations on how to run tests (which doesn't exactly mention how yet).
```
conda create -n pytorch-source python cmake ninja pyyaml typing_extensions
conda activate pytorch-source
python setup.py develop
```
* c) Add dependencies needed to run (most) unit tests
I assume _"Install from source"_ describes how to "install so I can do development.". This is why we recommend `python setup.py develop`. Doing development implies running unit tests.
```
conda create -n pytorch-source python cmake ninja pytest click
conda activate pytorch-source
pip install -r requirements.txt xdoctest
python setup.py develop
python test/run_test.py --keep-going
```
This still eclectically goes outside the simple principle _"Use dependencies in requirements.txt"_ without solving the whole problem. Instructions to get tests to run is not the goal of this PR.

* d) Point to ex [`.circleci/docker/requirements-ci.txt`](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt) or any of the system-specific sets of pinned requirements like [`requirements-{conda-env-macOS-ARM64}.txt`](https://github.com/pytorch/pytorch/blob/master/.github/requirements/conda-env-macOS-ARM64)
I don't want to jump into this rabbit hole.

<details>
  <summary>My system according to setup.py when verifying it runs</summary>

```
Target system: Darwin-21.6.0
Target processor: arm64
Host system: Darwin-21.6.0
Host processor: arm64
Detected C compiler: AppleClang @ /Library/Developer/CommandLineTools/usr/bin/cc
CMake: 3.22.1
Make program: /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/bin/ninja
Python version      : 3.10.8
Python executable   : /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/bin/python
Pythonlibs version  : 3.10.8
Python library      : /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/lib/libpython3.10.a
Python includes     : /opt/homebrew/Caskroom/miniconda/base/envs/pytorch-source/include/python3.10
Python site-packages: lib/python3.10/site-packages
```

</details>

See details in comments below.
[skip ci]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91861
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2023-02-03 00:52:23 +00:00
0f5b6caa16 [FSDP][optim_state_dict] Ignore the state check on rank that does not own the corresponding parameter (#93318)
When a rank does not own a parameter (parameter.numel() == 0), its optim state is not valid and should not be checked against the current saved one.

Differential Revision: [D42865237](https://our.internmc.facebook.com/intern/diff/D42865237/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93318
Approved by: https://github.com/rohan-varma
2023-02-03 00:50:04 +00:00
0844213f7d Improve Windows CI logic to cleanup leftover processes (#93914)
This is really hard to debug, the faulty runner already disappeared by the time I tried to login.  However, I figure out a way to get all the processes that could potentially hold the workspace by running:

```
choco install sysinternals -y
handle64.exe C:\actions-runner\_work\pytorch\pytorch\test\test-reports\
```

This gives me a better list of processes to kill.

```
PS C:\Windows\system32> handle64.exe C:\actions-runner\_work\pytorch\pytorch\test\test-reports\

Nthandle v5.0 - Handle viewer
Copyright (C) 1997-2022 Mark Russinovich
Sysinternals - www.sysinternals.com

python.exe         pid: 1672   type: File           574: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
python.exe         pid: 4604   type: File           6C8: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
python.exe         pid: 4604   type: File           6CC: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
ninja.exe          pid: 4764   type: File           468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
ninja.exe          pid: 4764   type: File           5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
cl.exe             pid: 5336   type: File           468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
cl.exe             pid: 5336   type: File           5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
nvcc.exe           pid: 1680   type: File           468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
nvcc.exe           pid: 1680   type: File           5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
cmd.exe            pid: 976    type: File           468: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
cmd.exe            pid: 976    type: File           5F4: C:\actions-runner\_work\pytorch\pytorch\test\test-reports\test_cpp_extensions_jit_r04_oc2b.log
```

Crossing my fingers to have this working
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93914
Approved by: https://github.com/clee2000
2023-02-03 00:47:50 +00:00
5817695bfa [pt2] Fix arange to match ATen behavior (#93353)
Fixes #92676

`arange` infers the output dtype from the argument types, but in order to reduce
falling back to ATen, inductor preferred to cast whole number float arguments to
int which gave the wrong output dtype. Instead, this decomposes floating point
arange into the prim equivalent for integers.

This also changes the signature of `prims.arange` to

```python
prims.iota(length, *, start, step, **factory_kwargs)
```

which only supports integers arguments. This is done because calculating the
output size from `start, end, step` is surprisingly complex and liable to off by
one errors so should not be duplicated in each backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93353
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-02-03 00:44:32 +00:00
264c89658b Move in backward opt setup to helper (#92059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92059
Approved by: https://github.com/awgu
2023-02-02 23:57:14 +00:00
e32d99ae19 [FSDP][optim_state_dict] Make FSDP.optim_state_dict compatbile with DMP (#93285)
`torchrec.DistributedModelParallel` overwrites `named_parameters` and is not compatible with `FullyShardedDataParallel`'s optim_state_dict. This PR adds some workaround in `FullyShardedDataParallel` to make both work together.

Differential Revision: [D42764611](https://our.internmc.facebook.com/intern/diff/D42764611/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93285
Approved by: https://github.com/rohan-varma
2023-02-02 23:42:54 +00:00
989722cd19 Use global PIC flag for XNNPACK (#93896)
Summary:
- XNNPACK Object libraries needs an explicit PIC flag when building static, PIC libXNPACK.a
- Without this link process runs into relocation errors
- Using this global switch to avoid updating XNNPACK CMake

Test Plan: CI

Differential Revision: D42944764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93896
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/salilsdesai
2023-02-02 23:38:21 +00:00
7db4d813c3 [dynamo 3.11] fix opmap key error (#93983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93983
Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/albanD
2023-02-02 23:05:44 +00:00
37a28255cb [dynamo, benchmarks] Fix dashboard update location (#94006)
Get dashboard uploading again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94006
Approved by: https://github.com/yanboliang
2023-02-02 23:01:57 +00:00
c2fb1f8ee4 Add is_integer assumption to ModularIndexing (#93903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93903
Approved by: https://github.com/ezyang
2023-02-02 22:51:34 +00:00
b7a5c79399 [inductor] Fix type inference in CPU masked operations (#93842)
Fixes #93351

The existing code guesses that `tmp3` is probably a `float`, and so truncates
any `double` values

```cpp
float tmp3 = 0.0;
if(tmp2)
{
    auto tmp4 = in_ptr0[i0];
    tmp3 = tmp4;
}
```

The proposed change is to generate a lambda expression that represents the body
of the masked operation, and infer the type from the return value:
```cpp
auto tmp3 = [&]
{
    auto tmp4 = in_ptr0[i0];
    return tmp4;
}
;
auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93842
Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/jansel
2023-02-02 22:42:19 +00:00
fde220ca44 [BE] Get rid of six in caffe2 code (#93956)
Mostly `s/string_types/str/` `s/binary_types/bytes/` and `s/text_types/str/`
Also `y.extend([str(x) for x in foo])`->`y.extend(map(str, foo))`
As Python-2 is long dead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93956
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-02-02 22:13:37 +00:00
37fcc53096 Remove import cycle from torch._refs.nn.functional (#93948)
This makes it possible to import torch._refs from
torch._subclasses.fake_tensor

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93948
Approved by: https://github.com/albanD
2023-02-02 21:06:37 +00:00
4e4293f15f Add meta registration for bucketize (#93893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93893
Approved by: https://github.com/zhxchen17
2023-02-02 21:03:08 +00:00
2b0d7e63f0 Move dynamo.optimizations.distributed to backends (#93408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93408
Approved by: https://github.com/wconstab
2023-02-02 20:42:17 +00:00
2910695942 Remove cuda 11.6 from nightly (#93979)
Remove cuda 11.6 from CI replace with 11.7
Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93979
Approved by: https://github.com/Skylion007, https://github.com/clee2000, https://github.com/malfet
2023-02-02 20:27:19 +00:00
ee2729890c Refactor dynamo register_backend/BACKENDS (#93389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93389
Approved by: https://github.com/voznesenskym
2023-02-02 19:41:48 +00:00
6e285c479d Remove cuda 11.6 from CI replace with 11.7 (#93406)
Remove cuda 11.6 from CI replace with 11.7
Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93406
Approved by: https://github.com/malfet, https://github.com/desertfire
2023-02-02 19:16:05 +00:00
f9d2600ce2 [Dynamo] Rename GuardBuilder.guarded_code -> check_fn_manager (#93934)
I was reading Dynamo code to learn and thought to clarify this naming to remove the `TODO`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93934
Approved by: https://github.com/ezyang
2023-02-02 17:20:25 +00:00
f5e9c8ce54 Revert "Remove CUDA 11.6 from nightly builds (#93404)"
This reverts commit c76ac8eef24299901e0b8fe163d2438528cbaf3e.

Reverted https://github.com/pytorch/pytorch/pull/93404 on behalf of https://github.com/clee2000 due to breaking lint
2023-02-02 17:10:01 +00:00
5d259425fc Revert "[inductor] fix crash issue when input is a view tensor (#90150)"
This reverts commit b11ec270bad96bf6078564ec4b2dc5dc69ea5bfa.

Reverted https://github.com/pytorch/pytorch/pull/90150 on behalf of https://github.com/clee2000 due to failing test_inplace_unsqueeze3 (__main__.CPUReproTests) https://github.com/pytorch/pytorch/actions/runs/4074618739/jobs/7020199369 b11ec270ba, marking as landrace cuz all jobs are green on pr
2023-02-02 17:06:34 +00:00
769eca6f97 Basic Validation for FSDP state_dict transformations of modules with persistent buffers (#93396)
Fixes #93391

Thank you to the PyTorch Distributed team for your invaluable contributions to the PyTorch ecosystem, your work is immensely impressive and inspiring!

As mentioned in  #93391, in preparing the downstream package I maintain ([finetuning-scheduler](https://github.com/speediedan/finetuning-scheduler)) to support PyTorch 2.0's version of FSDP, I noticed modules that include multiple persistent buffers were not having their state properly transformed during saving of `state_dict`s.

The issue was that the post-state_dict hook codepath shared by the `FULL_STATE_DICT` and `SHARDED_STATE_DICT` `_state_dict_type`s ([`_common_unshard_post_state_dict_hook`](332d55d3df/torch/distributed/fsdp/_state_dict_utils.py (L158))) was inadvertently referencing a local variable (`buffer`) that was used in a [prior transformation](332d55d3df/torch/distributed/fsdp/_state_dict_utils.py (L231)), instead of the `buffers` variable that should have been referenced in the iteration context:

332d55d3df/torch/distributed/fsdp/_state_dict_utils.py (L251-L253)

In this case, modules with a single persistent buffer or without mixed precision enabled would be unaffected. With multiple buffers and mixed precision enabled however, the issue may appear stochastically in proportion to the ratio of persistent buffers that have compatible dimensions (since the value of the last buffer visited in the ``buffer_names`` ``Set`` is copied to all buffers and the ``Set`` iteration order will of course vary)

```bash
File ".../pytorch/torch/nn/modules/module.py", line 2028, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel:
    size mismatch for _fsdp_wrapped_module.1._fsdp_wrapped_module.running_mean: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([10]).
```
To both address this issue and enhance coverage to avoid similar issues, this PR fixes the aforementioned typo and adds an additional set of basic tests that validate `state_dict` saving and loading for modules with persistent buffers in various contexts.

I found that adding another model along with additional buffer-specific logic to adapt [`test_basic_save_and_load_state_dict`](76b683b008/test/distributed/fsdp/test_fsdp_state_dict.py (L439)) for the purposes of this coverage seemed to increase complexity of that test to an undesirable degree.

Instead of adding additional complexity to that existing test, I've added a new test ``test_buffers_save_and_load_state_dict`` that does basic validation of ``state_dict`` saving and loading with mixed precision, ``state_dict_type`` and CPU offloading parameterization. Certainly let me know if you prefer I extend the logic of/add the persistent buffers model into the existing basic ``state_dict`` test, I'm happy to do so, just thought it was cleaner this way. Also, I thought doubling the number of tests with a ``use_orig_params`` parameterization or by testing additional different non-default buffer mixed precision data types was computationally imprudent but let me know if you'd like me to add those tests as well.

The only other notable test change is that I've refactored ``TestFSDPStateDict._compare_models`` to accommodate both ``buffers`` and ``parameters`` comparisons without code duplication.

Thanks again to the PyTorch Distributed team for your exceptional contributions. I've got some more to do adapting my package for 2.0's FSDP but it's been a delight so far thanks to your superlative work!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93396
Approved by: https://github.com/rohan-varma, https://github.com/awgu, https://github.com/fegin
2023-02-02 15:51:58 +00:00
98e1b3e93a Merge Inductor perf smoke test with other inductor CI tests (#93395)
Summary: Now the smoke test can also be triggered with the
ciflow/inductor label.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93395
Approved by: https://github.com/weiwangmeta, https://github.com/malfet
2023-02-02 15:42:59 +00:00
9ff7ddb241 [inductor] Don't import torchvision (#93027)
Fixes #93019

Since PyTorch regularly breaks binary compatibility, `torchvision` must be
compiled with the exact same version of PyTorch. If not, then importing it may
cause mysterious failures at runtime due to binary incompatibility.

This fixes the issue by delaying the `make_fallback` call for
`torchvision.roi_align` until the operator appears in a graph being lowered, by
which point the user must have imported torchvision themself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93027
Approved by: https://github.com/jansel
2023-02-02 15:42:32 +00:00
481a334b7a [FSDP][3/N] Refactor summon_full_params unit tests (#92298)
**Overview**
- This PR refactors the `summon_full_params()` unit tests to prepare for `unshard_params()` by consolidating redundant tests and improving others.
- This PR enables `CPUOffload(offload_params=True)` + `NO_SHARD` + `writeback=True`.
- This PR provides an improved error message when calling `summon_full_params()` from an invalid context (i.e. from forward, backward, or in `summon_full_params()`).

**Details**
<details>
<summary>Existing Unit Tests</summary>

`test_summon_full_param_writeback()` with `world_size=1`
`test_summon_full_param_writeback()` with `world_size=2`
- Tests that `writeback=True` persists write and that `writeback=False` does not persist write when modifying a root FSDP instance's `flat_param` (`modify_outer=True`) or a non-root FSDP instance's `flat_param` (`modify_outer=False`); additionally configures with `mixed_precision` and `use_orig_params`
- `CPUOffload(offload_params=True)` + `world_size=1` is not tested because it is not supported.
- The write inside `summon_full_params()` is on the `flat_param` itself, which is not the expected usage.

`test_summon_full_param_shard_value()`
- Tests that reconstructing the `flat_param` (by re-flattening and chunking parameters) inside `summon_full_params()` gives the same as the originally constructed `flat_param` when using a single FSDP instance
- This test seems to exercise the FSDP sharding algorithm, not the specification of `summon_full_params()`. The only relevant part being implicitly tested is that `model.parameters()` order is preserved.
- This test assumes the current FSDP sharding algorithm.

`test_summon_full_param_recursive()`
- Tests that `recurse=True` recursively applies to all FSDP instances and that `recurse=False` does not
- This test assumes the current FSDP sharding algorithm.

`test_cannot_summon_full_params_from_forward()`
`test_cannot_summon_full_params_from_backward()`
- Tests that calling `summon_full_params()` from inside the forward or backward raises an error
- The error message leaks `FlatParamHandle` to the user. I provided a better error in this PR.

`test_summon_full_params_respects_reshard_after_forward()`
- Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`)
- This test depends on FSDP internals (`flat_param._full_param_padded.storage().size()`).

`test_summon_single_param()`
- Tests that writing to padding with `writeback=True` does not persist those writes (doing so by using a singleton `(1, 1)` parameter that gets flattened and padded to `(2,)`)
- This test name is misleading.

`test_summon_full_params_equivalence()`
- Tests `writeback`, `rank0_only`, and `offload_to_cpu` with `writeback=not rank0_only`, using `CPUOffload(offload_params=True)` and including a `torch.cuda._sleep(int(1e6))` _after_ the write in `summon_full_params()`
- The PR introducing this test said that the `torch.cuda._sleep(int(1e6))` exercised the stream synchronization in `summon_full_params()`--namely that the current stream waits for the all-gather stream after all-gathering the parameters. I did not follow conceptually how that works since the `torch.cuda._sleep()` call happens after both the all-gather and write and is in the default stream, which seems to be after the relevant ops. If we clarify this, I can re-incorporate this into the unit tests. Doing so is not a high priority since `summon_full_params()` unshards in the default stream now and does not require stream synchronization.
- This unit test has overlap with `test_summon_full_param_writeback()` and can be coalesced.

`test_summon_from_non_fsdp()`
- Tests calling `summon_full_params()` with default args on a non-FSDP root module exposes the original parameters correctly
- This test actually covers much of the specification since checking for original parameter equivalence includes shape, value, device, etc. checking.

`test_reshard_outside_forward_backward_iteration()`
- Tests that calling `summon_full_params()` after forward preserves whether the padded unsharded `flat_param` data is freed or not (like `reshard_after_forward`) and that calling `summon_full_params()` after backward preserves that the padded unsharded `flat_param` data are freed; additionally configures `mixed_precision`
- This test strictly dominates `test_summon_full_params_respects_reshard_after_forward()` in strictness since it includes the check after backward as well.

`test_params_are_unflattenned()`
 - Tests that original parameters are exposed with the unflattened shape factoring in `rank0_only` (e.g. including that nonzero ranks reshard early when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision`

`test_params_count_and_value()`
- Tests that original parameters are all exposed and with the correct values factoring in `rank0_only` (e.g. including that nonzero ranks do not expose the original parameters when `rank0_only=True`) and that with `offload_to_cpu=True`, the `flat_param`s are moved back to GPU after exiting the context; additionally configures `mixed_precision`

`test_raises_rank0_with_writeback()`
- Tests that `rank0_only` + `writeback=True` raises an error

`test_named_parameters_buffers()`
- Tests that `named_parameters()` and `named_buffers()` return clean names (without FSDP prefixes) inside `summon_full_params()`

`test_with_grads_core()`
- Tests `with_grads=True` by comparing against DDP

`test_with_grads_none_grads()`
- Tests `with_grads=True` when ranks' `FlatParameter`s have `None` gradient

</details>

<details>
<summary>New Unit Tests</summary>

`test_unshard_params_writeback_no_shard()` (with `world_size=1`)
`test_unshard_params_writeback()` (with `world_size=2`)
- Tests the `writeback` argument (using the default value for all others)

`test_unshard_params_param_data_no_shard()` (with `world_size=1`)
`test_unshard_params_param_data()` (with `world_size=2`)
- Tests that parameters are exposed correctly for `recurse=True` and all other argument configs for a non-FSDP root module

`test_unshard_singleton_param_writeback()`
- Tests `writeback=True` for a singleton parameter, which includes testing that writing to padding does not persist

`test_unshard_params_respects_reshard()`
- Tests that unsharding parameters respects the expected reshard behavior between forward and backward as well as after backward

`test_unshard_params_recurse()`
- Tests the `recurse` argument (using default for all others)

`test_offload_to_cpu_no_shard_raises()`
- Tests that `offload_to_cpu=True` with `NO_SHARD` raises an error

</details>

<details>
<summary>Summary of Unit Test Changes</summary>

- `test_summon_full_param_writeback` -> `test_unshard_params_writeback()`
- `test_summon_full_params_equivalence()`, `test_params_are_unflattenned()`, `test_params_count_and_value()` -> `test_unshard_params_param_data()`
- `test_summon_full_params_respects_reshard_after_forward()`, `test_reshard_outside_forward_backward_iteration()` -> `test_unshard_params_respects_reshard()`
- `test_summon_full_param_recursive()` -> `test_unshard_params_recurse()`
- `test_named_parameters_and_buffers()` unchanged
- `test_with_grads_core()` unchanged
- `test_with_grads_none_grads()` unchanged
- `test_cannot_summon_full_params_from_forward()`, `test_cannot_summon_full_params_from_backward()` -> `test_unshard_params_from_forward_raises()`, `test_unshard_params_from_backward_raises()`
- `test_raises_rank0_with_writeback()` -> `test_rank0_only_with_writeback_raises()`
- `test_offload_to_cpu_no_shard_raises()` new
- `test_summon_full_param_shard_value()` removed

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92298
Approved by: https://github.com/rohan-varma
2023-02-02 15:10:14 +00:00
10990734ce [FSDP][2/N] _summon_full_params -> _unshard_params (#92297)
**Overview**
This PR stack will add support for unsharding FSDP's sharded parameters for `fully_shard`. This PR takes the first step by doing some internal refactoring.
- The existing API for wrapper FSDP is the static method `summon_full_params()`, which calls into the helper `_summon_full_params()`.
- This PR refactors:
    - `summon_full_params()` core logic to `_unshard_params()`
    - `_summon_full_params()` to `_unshard_params_recurse()`, which has a `recurse: bool` argument
    - Previous `_unshard_params()` to `_unshard_fsdp_state_params()`, which applies to a single FSDP state

**Details**
- This PR introduces `_get_fsdp_states_with_modules()` and `_get_root_fsdp_states_with_modules()`, which additionally return the modules along with the FSDP states. The modules are needed for handling `FlatParameter` registration.
    - We may be able to remove this if we clean up the `use_orig_params=True` vs. `False` code paths because for `True`, the `FlatParameter` is not registered, meaning that it does not need to be de-registered.
    - Since `fully_shard` requires `use_orig_params=True`, we may not need `_get_fsdp_states_with_modules()` and `_get_root_fsdp_root_modules()`; however, I prefer to make the separation of FSDP state and module explicit for now for clarity.

**Follow-Ups**
- `writeback=True` and `rank0_only=True` raises an error. The previous explanation was:
> is not supported, as model parameter shapes will be different across ranks, and writing to them can lead to inconsistencies across ranks when the context is exited.

I am not exactly sure what the different model parameter shapes refers to. However, I believe that we can support `writeback=True` and `rank0_only=True` by broadcasting the `FlatParameter` from rank 0 in the `finally`, writing back, and freeing. This should not increase the peak memory since rank 0 already holds the unsharded `FlatParameter` in GPU memory before writing back and nonzero ranks do not have any other unsharded `FlatParameter`s in GPU memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92297
Approved by: https://github.com/rohan-varma
2023-02-02 15:10:14 +00:00
c76ac8eef2 Remove CUDA 11.6 from nightly builds (#93404)
Remove CUDA 11.6 from nightly builds.
Following the Release readme here: https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93404
Approved by: https://github.com/malfet
2023-02-02 14:26:52 +00:00
a14e3190e3 Mark buffers that reuse other buffers (#93329)
Provides a way at codegen time to emit code conditioned on
having a fresh allocation vs reusing an input.

- For collective ops, if reusing an input, a copy can be skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93329
Approved by: https://github.com/jansel
2023-02-02 14:22:26 +00:00
d69876b2f1 Refactor to allow reuse of SchedulerNode.allocate (#93328)
Paves the way for ExternKernelSchedulerNode to also be able to
use the buffer inplace logic, needed for Collective ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93328
Approved by: https://github.com/jansel
2023-02-02 14:22:26 +00:00
84187399fc retire sparse_mask_helper (#91714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91714
Approved by: https://github.com/albanD, https://github.com/amjames, https://github.com/cpuhrsch
2023-02-02 13:53:02 +00:00
a2fded3001 update fbgemm third party (#93907)
To include https://github.com/pytorch/FBGEMM/pull/1572
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93907
Approved by: https://github.com/jianyuh
2023-02-02 13:37:19 +00:00
b11ec270ba [inductor] fix crash issue when input is a view tensor (#90150)
Fix the crash failure mentioned in https://github.com/pytorch/pytorch/issues/93460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90150
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-02 12:49:26 +00:00
a672fd1dba [Inductor] add config for weight prepacking (#93811)
Fixes #93495

Mkldnn weight prepacking may lead to large memory footprint for some models such as UniXcoder. In this case, disabling mkldnn weight prepacking is needed to avoid memory overload.

This PR adds a config for switching mkldnn weight prepacking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93811
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-02 12:18:40 +00:00
59ccc786df Check for none for NNModuleVariable.__module__ (#93326)
Test Plan: CI

Differential Revision: D42869182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93326
Approved by: https://github.com/suo
2023-02-02 09:41:41 +00:00
f4db47b176 inductor: don't assert error when do cpu fx fusion for training mode (#93837)
This PR will do:

1. skip CPU fx fusion for training mode.
2. skip Linear packed when input dim<2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93837
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-02 08:12:07 +00:00
3d020b6903 inductor: separate bias from PackeLinear for better performance (#93348)
For PakedLinear with has bias, we always copy bias to output before doing the computation:
d7a3f2128f/aten/src/ATen/native/mkldnn/Linear.cpp (L389-L397).

This PR separates bias from it which can make the bias add fused with the post-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93348
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-02 08:07:38 +00:00
4b0f1cc1ee [FSDP][optim_state_dict][10/N] Make optim_state_dict and optim_state_dict_to_load public (#92118)
Make optim_state_dict and optim_state_dict_to_load public APIs and consolidate them with state_dict by using the same state_dict_type to decide how to perform the optimizer state_dict save and load.

Differential Revision: [D42488022](https://our.internmc.facebook.com/intern/diff/D42488022/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92118
Approved by: https://github.com/rohan-varma
2023-02-02 08:04:20 +00:00
84ee50a28a inductor: add conv+hardsigmoid fusion for cpu path(reland) (#93341)
re-land https://github.com/pytorch/pytorch/pull/91433.

The internal ideep upgrade issue is resolved at https://github.com/pytorch/pytorch/pull/92239.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93341
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-02-02 07:59:56 +00:00
6f3018d50b [DTensor] implement dist_split as a sharding prop rule (#93306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93306
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
966030f7c7 [DTensor][fix] MultiThreadedTestCase misses _tls object and it won't reflect in CI (#93832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93832
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
b82f93d561 [DTensor] fix DTensorSpec dim_map description (#93160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93160
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
db87396474 inductor: align the decomposition output stride with none-decomposition path for torch.lerp (#93336)
As title, we need to align the decomposition output stride with the none-decomposition path for torch.lerp. And also enable it's lowering path for inductor.

After this PR for the following case:

```

def fn(i0, i1):
    # i0: (10, 3, 10)
    # i1: (3, 10, 10)
    x1 = i0.transpose(-2, -3)
    #y = torch.lerp(x1, x1, 70000)
    z = torch.lerp(i1, x1, 70000)
    return z

x0 = torch.rand(10, 3, 10)
x1 = torch.rand(3, 10, 10)
ret_eager = fn(x0, x1)
print('==== Eager mode OK! ====')
compiled = torch.compile(fn, fullgraph=True)
ret_compiled = compiled(x0, x1)
print('==== compile mode OK! ====')
ret_compiled = compiled(x0, x1)
print(torch.equal(ret_eager, ret_compiled))
print(ret_eager.stride()==ret_compiled.stride())
```

the inductor output code will be like(CPU):

```

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=0; i0<3; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<10; i1+=1)
            {
                for(long i2=0; i2<0; i2+=1)
                {
                    auto tmp7 = at::vec::Vectorized<float>::loadu(in_ptr0 + (10*i0) + (16*i2) + (30*i1));
                    auto tmp8 = at::vec::Vectorized<float>::loadu(in_ptr1 + (10*i1) + (16*i2) + (100*i0));
                    auto tmp0 = at::vec::Vectorized<float>(static_cast<float>(70000.0));
                    auto tmp1 = tmp0.abs();
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.5));
                    auto tmp3 = tmp1 >= tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1));
                    auto tmp5 = tmp0 - tmp4;
                    auto tmp6 = decltype(tmp5)::blendv(tmp0, tmp5, tmp3);
                    auto tmp9 = tmp7 - tmp8;
                    auto tmp10 = tmp6 * tmp9;
                    auto tmp11 = decltype(tmp7)::blendv(tmp8, tmp7, tmp3);
                    auto tmp12 = tmp10 + tmp11;
                    tmp12.store(out_ptr0 + (10*i1) + (16*i2) + (100*i0));
                }
                #pragma omp simd simdlen(8)
                for(long i2=0; i2<10; i2+=1)
                {
                    auto tmp7 = in_ptr0[i2 + (10*i0) + (30*i1)];
                    auto tmp8 = in_ptr1[i2 + (10*i1) + (100*i0)];
                    auto tmp0 = static_cast<float>(70000.0);
                    auto tmp1 = std::abs(tmp0);
                    auto tmp2 = static_cast<float>(0.5);
                    auto tmp3 = tmp1 >= tmp2;
                    auto tmp4 = static_cast<float>(1);
                    auto tmp5 = tmp0 - tmp4;
                    auto tmp6 = tmp3 ? tmp5 : tmp0;
                    auto tmp9 = tmp7 - tmp8;
                    auto tmp10 = tmp6 * tmp9;
                    auto tmp11 = tmp3 ? tmp7 : tmp8;
                    auto tmp12 = tmp10 + tmp11;
                    out_ptr0[i2 + (10*i1) + (100*i0)] = tmp12;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    buf1 = empty_strided((3, 10, 10), (100, 10, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg0_1
    del arg1_1
    return (buf1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((10, 3, 10), (30, 10, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((3, 10, 10), (100, 10, 1), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1, arg1_1]))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93336
Approved by: https://github.com/jansel
2023-02-02 07:40:28 +00:00
cff4d3bb22 inductor: fix convert_shape_to_symint (#93349)
Fixes https://github.com/pytorch/pytorch/issues/93833.

When `lst` is composed of a mix of static shapes and `sympy.Expr`, convert static shapes to ints and `sympy.Expr` to `symints`.
The old logic required that all of the elements of `lst` be static and it can then convert them to ints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93349
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-02 07:34:57 +00:00
e7ace1ff93 [PT-D][NamedOptimizer][6/N] Upstream init_state from keyed to NamedOptimizer (#93887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93887
Approved by: https://github.com/rohan-varma
2023-02-02 07:14:49 +00:00
f58ba553b7 [ROCm] Fix distributed tests failure and enable ROCm distributed CI (#92932)
Distributed tests fails due to  AttributeError: 'torch._C._distributed_c10d.ProcessGroup'
object has no attribute '_set_backend' , when running distributed/test_c10d_spawn_gloo.py
This leads to tests not progressing resulting in hang.
Use _register_backend instead of _set_backend.

Fixes https://github.com/pytorch/pytorch/pull/91632

More details of issue: https://github.com/pytorch/pytorch/pull/91632#issuecomment-1402831950 and https://github.com/pytorch/pytorch/pull/91632#issuecomment-1405646977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92932
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/H-Huang
2023-02-02 04:29:10 +00:00
569f2e3228 Remove many untested dynamo backends (#93382)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93382
Approved by: https://github.com/mlazos, https://github.com/voznesenskym
2023-02-02 04:08:22 +00:00
653dc73df0 [SDPA] Wire up FlashAttention's backward (#92917)
# Summary
This PR creates _flash_attention_backward and _scaled_dot_product_flash_attention_backward native functions and registers them to the respective derivatives.yaml.

The goal is to replicate the torch.autograd.Function defined in the FlashAttention repo [here](33e0860c9c/flash_attn/flash_attn_interface.py (L126)) natively in PyTorch.  One thing that we don't have access to is ctx.save_for_backward in native PyTorch so in order to save these variables I extended the returned objects from the forward functions.

### MetaFunctions
I also updated the FlashAttention meta functions to mirror the real outputs now. As well I added a meta registration for backwards. I have an XLMR training script and while eager training now works with FlashAttention compiling this module fails with the inductor error down below.

### Questions?
Performance issues vs mem efficient when using torch.nn.mha_forward

TorchCompile -> See purposed solution below.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92917
Approved by: https://github.com/cpuhrsch
2023-02-02 04:02:30 +00:00
b6367c8aa4 Remove torch/_dynamo/optimizations/inference.py (#93381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93381
Approved by: https://github.com/Chillee
2023-02-02 03:42:50 +00:00
68b06ee4d4 Add torch_compile_debug/ to .gitignore (#93889)
# Summary
I have almost checked this in multiple times. Add to gitignore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93889
Approved by: https://github.com/malfet
2023-02-02 03:31:55 +00:00
61d3589e07 [vision hash update] update the pinned vision hash (#93892)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93892
Approved by: https://github.com/pytorchbot
2023-02-02 03:18:25 +00:00
489e74cf73 Fix lint after #93278 (#93902)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93902
Approved by: https://github.com/jansel
2023-02-02 03:16:29 +00:00
6c93c3b58a Save and restore functorch configuration in minified scripts (#93853)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93853
Approved by: https://github.com/williamwen42
2023-02-02 03:09:46 +00:00
caf1b27196 Fix Upsample/EmbeddingBag module printing (#93850)
The fix generalizes but I want someone else to holistically figure this out.

Fixes https://github.com/pytorch/pytorch/issues/93233
Fixes https://github.com/pytorch/pytorch/issues/93512

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93850
Approved by: https://github.com/albanD
2023-02-02 02:50:29 +00:00
306dc2ed1a Make ShapeEnv deepcopy'able (#93403)
We sometimes put ShapeEnv on GraphModule, and code in our testing
utils assume that you can deepcopy a GraphModule, so it's good
for ShapeEnv to be deepcopy'able too.  This is done by making the
TLS module-wide rather than per-ShapeEnv.  We never really have
multiple ShapeEnv so this is a good trade.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93403
Approved by: https://github.com/jbschlosser
2023-02-02 02:50:23 +00:00
54eedf6fa6 Fix test_jit_cuda_archflags on Windows (#93332)
Fixes https://github.com/pytorch/pytorch/issues/61655

The test is flaky and fails whenever `test_jit_cuda_archflags` is run.  The latter `test_jit_cuda_archflags` was slow test in the old Windows runner.  It's currently running again on trunk due to the problem with populating slow-test JSON file ~Interestingly, its performance is getting better in the new Windows G5 runner and it becomes a borderline slow test, where it run sometimes~.  Whenever it runs, the next test `test_jit_cuda_extension` will fail.

* Build and load different CUDA arch modules from `test_jit_cuda_archflags` in separate processes to avoid importing them into the current one.  The test only checks the build artifacts.  Importing them cause `test_jit_cuda_extension` to fail as describe in https://github.com/pytorch/pytorch/issues/61655
* Clean up the temp build dir on Windows.  Windows CUDA runner is non-ephemeral, so it's better to clean thing up properly to avoid any funny business the next time the runner is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93332
Approved by: https://github.com/davidberard98
2023-02-02 02:49:27 +00:00
d7b39b17ab Remove torch/_dynamo/optimizations/{analysis,log_args}.py (#93279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93279
Approved by: https://github.com/voznesenskym
2023-02-02 02:34:36 +00:00
d37bc6d04e Revert "[fx] add SymPy assumptions to FloorDiv (#93185)"
This reverts commit c4ccf7e12147671fdc3535a222260d687c2128a2.

Reverted https://github.com/pytorch/pytorch/pull/93185 on behalf of https://github.com/ezyang due to appears to be breaking people outside of ci
2023-02-02 02:26:11 +00:00
57d74aae55 Remove torch/_dynamo/optimizations/normalize.py (#93278)
This file was largely made obsolete by dispatcher level functionalization,
and has been disabled by config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93278
Approved by: https://github.com/voznesenskym
2023-02-02 02:02:54 +00:00
6a4bf3b71b feat(fx): make_fx should be aware of functions wrapped with @fx.wrap (#93273)
Fixes https://github.com/pytorch/pytorch/issues/89421

The strategy is to patch the given function wrapped with `@torch.fx.wrap` so that if a tensor tracer is active, we will `proxy_call` the function.

`proxy_call` will also skip certain checks if the function to proxy call is not a torch op (checked with `isinstance(.., OpOverload)`.

@IvanYashchuk @ezyang @Chillee
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93273
Approved by: https://github.com/ezyang
2023-02-02 01:57:52 +00:00
dd8662d5c8 [BE] Migrate Anaconda Prune jobs from CircleCI to GHA (#93876)
We need periodical anaconda prune jobs to remove older packages (e.g. pytorch, torchvision, torchaudio, torchtext etc) from channels like pytorch-nightly and pytorch-test.
Currently it is done in circleci (e.g. https://app.circleci.com/pipelines/github/pytorch/pytorch/647201/workflows/72e5af30-0d54-44c1-8d9b-4c5502d27c9d/jobs/17260775) and triggered by postnightly update (https://github.com/pytorch/pytorch/tree/postnightly)

However, this postnightly branch triggers so many useless jobs (dozens of them failed due to docker command too long. Why? Because change history was part of docker command and it exceeds max INT).

<img width="756" alt="image" src="https://user-images.githubusercontent.com/109318740/216139179-3c913094-82cb-4605-99b7-23a21b4cbb36.png">

Therefore, we should stop the postnightly jobs (waste of resources) but save anaconda prune jobs.
This PR attempts to achieve this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93876
Approved by: https://github.com/atalman
2023-02-02 01:56:13 +00:00
ca9ebf9e2b Delete dynamo_import and inductor_import (#93851)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93851
Approved by: https://github.com/albanD, https://github.com/jansel
2023-02-02 01:51:29 +00:00
74592a43d0 Update tests to use ConfigModule.patch (#93254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93254
Approved by: https://github.com/voznesenskym
2023-02-02 00:56:55 +00:00
31d466f925 [BE][ez] Move hardcoded constants to function args (#93874)
Also use tail-recursion instead of for loop to dismantle pyramid of doom

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93874
Approved by: https://github.com/clee2000
2023-02-02 00:47:18 +00:00
23d58fedb1 Use ConfigModule for _functorch.config (#93375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93375
Approved by: https://github.com/Chillee
2023-02-02 00:31:24 +00:00
0485bf5398 Avoid saving pointwise intermediate to global memory if followed by a reduction (#93810)
Should fix https://github.com/pytorch/pytorch/issues/91880 and maybe https://github.com/pytorch/pytorch/issues/91799

For this code:
```
@torch.compile
def f(a, b):
    return (a-b).sum(dim=-1).amax(dim=-1)

N = 2**14
K = 5

A = torch.randn(N, 1, K, device='cuda')
B = torch.randn(1, N, K, device='cuda')
bench(lambda: f(A, B), name=f"K={K}")
print(f"peak Mem: {torch.cuda.max_memory_allocated()/1e9}GB")
```

Before my change, we generated (simplified versions)
```
def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    ...
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp1 = tl.load(in_ptr1 + (5*r1), rmask, eviction_policy='evict_last')
       ...
        tmp18 = tmp14 + tmp17
        tl.store(out_ptr0 + (r1 + (16384*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp18, rmask & xmask)
    _tmp20 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf")
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp19 = tl.load(out_ptr0 + (r1 + (16384*x0)), rmask & xmask, eviction_policy='evict_last')
        _tmp20 = tl.where(rmask & xmask & (_tmp20 < tmp19), tmp19, _tmp20)
    tmp20 = tl.max(_tmp20, 1)[:, None]
    tl.store(out_ptr1 + x0, tmp20, xmask)
```
and after
```
def triton_(in_ptr0, in_ptr1, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
   ...
    _tmp19 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf")
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp1 = tl.load(in_ptr1 + (5*r1), rmask, eviction_policy='evict_last')
        ...
        tmp18 = tmp14 + tmp17
        _tmp19 = tl.where(rmask & xmask & (_tmp19 < tmp18), tmp18, _tmp19)
    tmp19 = tl.max(_tmp19, 1)[:, None]
    tl.store(out_ptr1 + x0, tmp19, xmask)
```
<details>
  <summary>full kernels here
</summary>
Before:
  ```
def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 16384
    rnumel = 16384
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rbase = tl.arange(0, RBLOCK)[None, :]
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (5*x0), xmask)
    tmp3 = tl.load(in_ptr0 + (1 + (5*x0)), xmask)
    tmp7 = tl.load(in_ptr0 + (2 + (5*x0)), xmask)
    tmp11 = tl.load(in_ptr0 + (3 + (5*x0)), xmask)
    tmp15 = tl.load(in_ptr0 + (4 + (5*x0)), xmask)
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp1 = tl.load(in_ptr1 + (5*r1), rmask, eviction_policy='evict_last')
        tmp4 = tl.load(in_ptr1 + (1 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp8 = tl.load(in_ptr1 + (2 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp12 = tl.load(in_ptr1 + (3 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp16 = tl.load(in_ptr1 + (4 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp2 = tmp0 - tmp1
        tmp5 = tmp3 - tmp4
        tmp6 = tmp2 + tmp5
        tmp9 = tmp7 - tmp8
        tmp10 = tmp6 + tmp9
        tmp13 = tmp11 - tmp12
        tmp14 = tmp10 + tmp13
        tmp17 = tmp15 - tmp16
        tmp18 = tmp14 + tmp17
        tl.store(out_ptr0 + (r1 + (16384*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp18, rmask & xmask)
    _tmp20 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf")
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp19 = tl.load(out_ptr0 + (r1 + (16384*x0)), rmask & xmask, eviction_policy='evict_last')
        _tmp20 = tl.where(rmask & xmask & (_tmp20 < tmp19), tmp19, _tmp20)
    tmp20 = tl.max(_tmp20, 1)[:, None]
    tl.store(out_ptr1 + x0, tmp20, xmask)
```
After:
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 16384
    rnumel = 16384
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rbase = tl.arange(0, RBLOCK)[None, :]
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (5*x0), xmask)
    tmp3 = tl.load(in_ptr0 + (1 + (5*x0)), xmask)
    tmp7 = tl.load(in_ptr0 + (2 + (5*x0)), xmask)
    tmp11 = tl.load(in_ptr0 + (3 + (5*x0)), xmask)
    tmp15 = tl.load(in_ptr0 + (4 + (5*x0)), xmask)
    _tmp19 = tl.zeros([XBLOCK, RBLOCK], tl.float32) + float("-inf")
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp1 = tl.load(in_ptr1 + (5*r1), rmask, eviction_policy='evict_last')
        tmp4 = tl.load(in_ptr1 + (1 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp8 = tl.load(in_ptr1 + (2 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp12 = tl.load(in_ptr1 + (3 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp16 = tl.load(in_ptr1 + (4 + (5*r1)), rmask, eviction_policy='evict_last')
        tmp2 = tmp0 - tmp1
        tmp5 = tmp3 - tmp4
        tmp6 = tmp2 + tmp5
        tmp9 = tmp7 - tmp8
        tmp10 = tmp6 + tmp9
        tmp13 = tmp11 - tmp12
        tmp14 = tmp10 + tmp13
        tmp17 = tmp15 - tmp16
        tmp18 = tmp14 + tmp17
        _tmp19 = tl.where(rmask & xmask & (_tmp19 < tmp18), tmp18, _tmp19)
    tmp19 = tl.max(_tmp19, 1)[:, None]
    tl.store(out_ptr1 + x0, tmp19, xmask)
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93810
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-02 00:02:14 +00:00
8594529c2e Run ASAN in 4xlarge in all shards (#93879)
We used to have ASAN shard 4 and 5 running in 4xlarge because they timed out.  With the current issue with test time collecting, I guess the shard allocation has been changed, and there are now timeout from shard 1 to 3.  It's better to just have all shards using the same runner for consistency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93879
Approved by: https://github.com/clee2000
2023-02-01 23:37:23 +00:00
3e6978172e [dynamo] Handle general tensor attributes with a getattr proxy node (#91840)
**Background:** Before this PR, support in dynamo for tensor attributes (e.g. `x.H`, `x.T`, ...) need to be individually implemented one-by-one. This could potentially lead to errors, e.g. if the implementation in [variables/tensor.py](21c7c7c72f/torch/_dynamo/variables/tensor.py (L160)) differs from the implementation from a direct call to the attribute. For attributes that were not special-cased in tensor.py, dynamo tracing would fail. This PR adds generic support for tensor attributes that return tensors without needing to specially handle them. (Notably, for x.real and x.imag, which previously weren't supported).

**In this PR:** This directly creates a proxy node for a `"call_function"` node with `target=getattr`, and feeds it into wrap_fx_proxy. This will produce a TensorVariable for the attribute returned.

This also removes the implementations for H, T, mH, mT which were broken (previously `torch.relu(x.T)` would fail). They now fall back to this default implementation (for which `torch.relu(x.T)` passes).

**Further context**:

* Ed's original suggestion in [90463](https://github.com/pytorch/pytorch/pull/90463#discussion_r1043398340) is to use `torch.Tensor.H.__get__(x)`. I wasn't able to get this to work; fx compilation fails with `getset_descriptor does not have attribute __module__`. Basically, the `__module__` attribute which is available on most python attributes, is not available on `getset_descriptor` objects. (i.e., these are implemented in C++ as attributes on torch.Tensor, so they don't obey some assumptions made by fx)
* Although both tensor attributes and methods (like `x.relu()`) both go through this, this PR should only handle attributes (e.g. see the `"getset_descriptor"` in variables/tensor.py). Methods are handled already by by GetAttrVariable.
* Prior to this PR, we already returned GetAttrVariables for unsupported attrs: the parent caller would catch the NotImplementedError and fallback to returning a GetAttrVariable. But if this GetAttrVariable was ever passed into a torch.\* function (as it could quite possibly be, since most of these attrs are tensors), it would fail because its proxy node would be missing an [example_value](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/utils.py#L1017). So: before, for some tensor x, `x.real` would work fine; but `torch.relu(x.real)` would fail.

**Testing**: added tests in test_misc.py for x.real, x.imag, x.T, x.real.T.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91840
Approved by: https://github.com/ezyang
2023-02-01 22:34:03 +00:00
8c1ee89f19 Added super init to Module (#91819)
Added super init to Module for complex user modules derived from multiple python classes.
And by adding the super __init__ call at the end so it doesn't change any functionality of Module class.

I am working on building a module for simulating analog neural network on PyTorch.
and this small change is really useful for that and we can definitely think of many other useful cases especially for more module or mro hierarchy.

Issues: https://github.com/pytorch/pytorch/issues/28746, https://github.com/pytorch/pytorch/issues/48626, https://github.com/pytorch/pytorch/issues/61662, https://github.com/pytorch/pytorch/issues/74036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91819
Approved by: https://github.com/albanD
2023-02-01 22:17:59 +00:00
207399cf5f Add repro_forward_only for inference debugging (#93856)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93856
Approved by: https://github.com/williamwen42
2023-02-01 22:03:13 +00:00
03b465a6d0 Add --iterations to benchmark script (#93858)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93858
Approved by: https://github.com/williamwen42
2023-02-01 21:56:49 +00:00
3fb6e119e2 [PT-D][TP] Fix the module registration in TP API (#93412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93412
Approved by: https://github.com/XilunWu
2023-02-01 21:03:56 +00:00
498c6ed8d8 Add missing format string (#93866)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93866
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-02-01 20:56:46 +00:00
87b9ab4870 [CI] Add Py-3.11 wheels for all platforms (#93400)
As python-3.11 is now available on Conda for both MacOS and Windows

Disable dimtorch for Python-3.11 on Windows as its current implementation relies on internal symbols which are not exposed on Windows runtime (and to be frank, not sure why they are exposed on Linux/Mac), see https://github.com/pytorch/pytorch/issues/93854

As with the previous PR, most of the changes are not in PyTorch repo, but in builder, namely:
b71049dcbc
ece340ef7e
b0071ac366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93400
Approved by: https://github.com/weiwangmeta, https://github.com/atalman
2023-02-01 19:51:19 +00:00
2ea3036d8b Disable cudagraphs by default (#93253)
`torch.compile` used to disable cudagraphs by default (removed one PR up in this stack), which was a bit confusing because it caused the config setting to be ignored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93253
Approved by: https://github.com/ngimel
2023-02-01 19:38:05 +00:00
45eadc2c4d ConfigModule for _{dynamo,inductor}.config (#93252)
This refactors the way dynamo/inductor configs are handled to check for invalid configs and add options like patching and serialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93252
Approved by: https://github.com/voznesenskym
2023-02-01 19:38:05 +00:00
a23ed38f9a [mta][foreach] Implement fused adamw (#88015)
related: https://github.com/pytorch/pytorch/issues/68041, https://github.com/pytorch/pytorch/issues/71274, https://github.com/pytorch/pytorch/issues/80167
possibly related to https://github.com/pytorch/pytorch/issues/80595#issuecomment-1178519436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88015
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-02-01 19:32:29 +00:00
86ab4d49d4 [pruning][core][feature] LSTM Structured Pruning prune_functions + pattern (#90801)
Summary:

This PR adds in support for LSTM Structured Pruning.

- Adds in LSTMSaliencyPruner, an implemented pruner that splits the packed weights, finds the appropriate mask for each piece individually based on saliency, and then combines to create an overall mask for the LSTM.
- Adds in pruning functions for LSTM pruning, which will split the weights, apply the masks, and then recombine the pruned weights. Works for both single and multiple-layer LSTMs.

Also added a basic pattern to the default set of of patterns for
LSTM -> Linear pruning
LSTM -> LayerNorm -> Linear pruning

Adds in test to check that LSTM pruning works, as well as for LSTMSaliencyPruner

Test Plan:
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_single_layer`
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_linear_multiple_layer`
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_single_layer`
`python test/test_ao_sparsity.py -- TestBaseStructuredSparsifier.test_prune_lstm_layernorm_linear_multiple_layer`
`python test/test_ao_sparsity.py -- TestSaliencyPruner.test_lstm_saliency_pruner_update_mask`
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D42199001](https://our.internmc.facebook.com/intern/diff/D42199001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90801
Approved by: https://github.com/jerryzh168
2023-02-01 19:29:03 +00:00
f577a5279b Enable USE_CUDA (#92640)
Summary: `USE_CUDA` is needed in the bazel definitions to ensure that `USE_CUDA` is applied everywhere it should be.

We also fix some test code to use the correct properties.

Test Plan: Sandcastle

Differential Revision: D42616147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92640
Approved by: https://github.com/ezyang
2023-02-01 19:00:26 +00:00
e80af53bf0 Move bazel back to pull (#93867)
Fixes #ISSUE_NUMBER
Revert of https://github.com/pytorch/pytorch/pull/93296 but in a new PR b/c xla was already put back in https://github.com/pytorch/pytorch/pull/93334
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93867
Approved by: https://github.com/huydhn
2023-02-01 18:58:31 +00:00
6fe234ecc4 pnp: move shadow loggers to parent module (#91428)
Summary:

Before this PR, PNP added shadow loggers to insides of
the shadow wrapper modules.

This PR moves those loggers to the parent module.

There are a couple of benefits:
1. this will unbreak features of quantization API which don't support loggers (such as hardcoding model output to be quantized)
2. this makes it easier to look at the parent graph and visualize what is logged, since now all the logging is in the same graph
3. this will make it easier to implement features such as propagation error calculation in the future

Test plan:

```
python test/test_quantization.py -k NShadows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91428
Approved by: https://github.com/jerryzh168
2023-02-01 18:34:04 +00:00
56f9475625 ns: change PNP testing to use QNNPACK (#91421)
Summary:

Changes the PNP test cases to use QNNPACK. The only reason is because
I'm switching to Mac M1 as my primary machine, which supports QNNPACK
but not fbgemm, and it's convenient for me to be able to run these
locally.

PNP itself is not backend specific, so it does not matter which backend
the functionality is tested on.

Test plan:

```
python test/test_quantization.py -k NShadows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91421
Approved by: https://github.com/jerryzh168
2023-02-01 18:34:04 +00:00
1dcd2609b5 Add retries for get_workflow_job_id and try catch in upload_test_stats (#93401)
upload_test_stats keeps failing b/c it can't handle when the id is workflow-<workflow_id> so add a try catch for this.

Add retries to get_workflow_job_id to try and reduce the number of times the id can't be found

Failure to upload test stats and inability to get the job id cause our sharding infra and slow test infra (probably also flaky test detection) to be less effective.  This does not completely resolve the issue since we do rely on the job id

Failure to get the workflow job id happens tragically often, hopefully retries will help
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93401
Approved by: https://github.com/huydhn
2023-02-01 18:33:32 +00:00
eb987abd24 Clean up leftover processes on non-ephemeral Windows runner (#93414)
In some rare cases, checking out PyTorch on non-ephemeral Windows G5 runner could fail because of leftover processes from the previous workflow.  For example, https://github.com/pytorch/pytorch/actions/runs/4058503816/jobs/6986773162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93414
Approved by: https://github.com/clee2000
2023-02-01 17:52:56 +00:00
77cbaedd5c [docs] Add section about tensor hooks on in-place in autograd note (#93116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93116
Approved by: https://github.com/albanD
2023-02-01 17:35:21 +00:00
76b999803a add filelock as a dependency (#91607)
`filelock` is a dependency now for inductor's caching mechanism and CPU backend.

Add `filelock` as a dependency

Fixes https://github.com/pytorch/pytorch/issues/93499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91607
Approved by: https://github.com/anijain2305, https://github.com/jansel
2023-02-01 17:30:55 +00:00
d5901fcc80 fix(fx): make all make_fx invocations isolated (opaque to higher make_fx invocations) by default (#93290)
Fixes https://github.com/pytorch/pytorch/issues/88996#issuecomment-1409174554

Example code:
```python
import torch
from torch.fx.experimental.proxy_tensor import make_fx, wrapper_and_args_for_make_fx

@torch.fx.wrap
def func(a, b):
    return b.expand([1, a.shape[0], b.shape[-1]])

a = torch.randn(3, 4)
b = torch.randn(4)

class TestMode(torch.overrides.TorchFunctionMode):
    def __torch_function__(self, func, types, args=(), kwargs={}):
        if torch.overrides.resolve_name(func) in ["torch.Tensor.expand"]:
            print(f"TestMode: {func} {args} {kwargs}")
            wrapped, all_args = wrapper_and_args_for_make_fx(func, args, kwargs)
            gm = make_fx(wrapped, tracing_mode="real")(all_args)

        return func(*args, **kwargs)

with TestMode():
    gm = make_fx(func, tracing_mode="symbolic")(a, b)

gm.graph.print_tabular()
```
Before:
```
opcode         name        target               args                              kwargs
-------------  ----------  -------------------  --------------------------------  --------
placeholder    a_1         a_1                  ()                                {}
placeholder    b_1         b_1                  ()                                {}
call_function  detach      aten.detach.default  (b_1,)                            {}
call_function  detach_1    aten.detach.default  (detach,)                         {}
call_function  sym_size    aten.sym_size        (a_1, 0)                          {}
call_function  sym_size_1  aten.sym_size        (b_1, 0)                          {}
call_function  expand      aten.expand.default  (b_1, [1, sym_size, sym_size_1])  {}
call_function  detach_2    aten.detach.default  (expand,)                         {}
call_function  expand_1    aten.expand.default  (b_1, [1, sym_size, sym_size_1])  {}
output         output      output               (expand_1,)                       {}
```

After:
```
opcode         name        target               args                              kwargs
-------------  ----------  -------------------  --------------------------------  --------
placeholder    a_1         a_1                  ()                                {}
placeholder    b_1         b_1                  ()                                {}
call_function  sym_size    aten.sym_size        (a_1, 0)                          {}
call_function  sym_size_1  aten.sym_size        (b_1, 0)                          {}
call_function  expand      aten.expand.default  (b_1, [1, sym_size, sym_size_1])  {}
output         output      output               (expand_1,)                       {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93290
Approved by: https://github.com/ezyang
2023-02-01 17:28:48 +00:00
2fc2ca7652 [BE]: Fix CMake LTO policy on pytorch (#93388)
Not this is a non-functional change since non of our CIs actually build with LTO.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93388
Approved by: https://github.com/albanD
2023-02-01 17:06:53 +00:00
bf2e2fea41 [dynamo] getattr for EnumVariables (#93397)
I'm not sure if this is the correct fix, but it allowed me to enable the test case I added which I encountered in an internal model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93397
Approved by: https://github.com/yanboliang
2023-02-01 16:29:39 +00:00
cyy
37f7c00a8a More fixes and improved clang-tidy checkers (#93213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93213
Approved by: https://github.com/Skylion007
2023-02-01 14:44:17 +00:00
679e869af0 [inductor] only check mutations attr for TritonKernel (#92277)
Fixes https://github.com/pytorch/pytorch/issues/93506.

In https://github.com/pytorch/pytorch/pull/91575, for in-place buffers reuse, a check has been added on the `mutations` attr of the kernel:
5e0d3458eb/torch/_inductor/scheduler.py (L300)

While `mutations` are not tracked in cpp kernels, `getattr(V.kernel, "mutations", None) is not None` will always be `False`.
This PR only checks the `mutations` attr for TritonKernel.

UT is added to guarantee that `in_out_ptr` is in the generated code.
#### Cpp code before this fix:
```python
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long i0=0; i0<8; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
                auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(8.0));
                auto tmp2 = tmp0 / tmp1;
                tmp2.store(out_ptr0 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=128; i0<128; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = static_cast<float>(8.0);
                auto tmp2 = tmp0 / tmp1;
                out_ptr0[i0] = tmp2;
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    buf0 = empty_strided((2, 8, 8), (64, 8, 1), device='cpu', dtype=torch.float32)
    extern_kernels.bmm(as_strided(arg0_1, (2, 8, 4), (32, 4, 1)), as_strided(arg1_1, (2, 4, 8), (32, 1, 4)), out=buf0)
    del arg0_1
    del arg1_1
    buf1 = empty_strided((1, 2, 8, 8), (128, 64, 8, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()))
    return (buf1, )
```
#### Cpp code after this fix:
```python
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(float* __restrict__ in_out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for
            for(long i0=0; i0<8; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 16*i0);
                auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(8.0));
                auto tmp2 = tmp0 / tmp1;
                tmp2.store(in_out_ptr0 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=128; i0<128; i0+=1)
            {
                auto tmp0 = in_out_ptr0[i0];
                auto tmp1 = static_cast<float>(8.0);
                auto tmp2 = tmp0 / tmp1;
                in_out_ptr0[i0] = tmp2;
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    buf0 = empty_strided((2, 8, 8), (64, 8, 1), device='cpu', dtype=torch.float32)
    extern_kernels.bmm(as_strided(arg0_1, (2, 8, 4), (32, 4, 1)), as_strided(arg1_1, (2, 4, 8), (32, 1, 4)), out=buf0)
    del arg0_1
    del arg1_1
    buf1 = as_strided(buf0, (1, 2, 8, 8), (128, 64, 8, 1)); del buf0  # reuse
    kernel_cpp_0(c_void_p(buf1.data_ptr()))
    return (buf1, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92277
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-01 14:12:33 +00:00
c4ccf7e121 [fx] add SymPy assumptions to FloorDiv (#93185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93185
Approved by: https://github.com/ezyang
2023-02-01 13:50:59 +00:00
f1030dcc6d [Re-open 90267] [inductor] weight prepack for single conv_transpose2d (#91956)
Re-open https://github.com/pytorch/pytorch/pull/90267 since earlier pr on that stack got reverted.
Depend on internal ideep upgrade.
[Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91956
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-01 12:36:52 +00:00
66fd99cc09 Use symbolic tracing_mode for aot repro with dynamic_shapes (#93393)
This is by no means a complete fix for broken aot symbolic
tracing, but it is definitely better what we have right now.

More context: https://github.com/pytorch/pytorch/issues/93367

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93393
Approved by: https://github.com/SherlockNoMad, https://github.com/bdhirsh
2023-02-01 11:51:00 +00:00
298075e183 use aten parallel on lu factor (#93037)
https://github.com/pytorch/pytorch/issues/91536. One issue mentioned torch.inv is pretty slow for large batches with small matrices on cuda.

I checked the CPU implementations and found we have an optimize opportunity.
For torch.inv, the CPU pass chooses to solve it by `lu_factor` + `lu_solve`.
The `lu_factor` loop on `batch_size` dimension and the parallel happened inside lapack
 - For small matrix, the computational complexity is too tiny to parallel inside lapack.
 - Even for large matrix, the parallelization efficiency is not good in lapack ( it performs worse than using at::parallel outside)
 - Only for small batch size + small matrix size, the omp overhead will take too large overhead.

Based on the above observations, using at::parallel outside on lu_factor will have a pretty large benefit.

Here is the code/data collected on 32 core ICX system.
```python
import torch
import time

def bench(bs, r):
    x = torch.randn(int(bs), r, r)
    start = time.time()
    for i in range(100):
        y1 = torch.linalg.lu_factor(x)
    end = time.time()
    print(r, bs)
    print(end - start)
    print((end - start)/(r**3))

for r in (4, 16, 64):
    for bs in (1e2, 1e4, 1e6):
        bench(bs, r)
```

| bs/rank | 100/4 |  10000/4 |  1000000/4 | 100/16 |  10000/16|  1000000/16| 100/64|  10000/64|  1000000/64|
| ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| parallel inside lapack | 0.0028 |1.077 | 11.99|0.0163 | 1.5260|153.17 |0.2021|20.93 | 1877|
| parallel outside lapack | 0.0087 | 0.0247 | 1.566| 0.0044|0.1678 |17.63|0.038|2.311 | 208.6|
|speed up ratio| 0.32x | 43.6x  | 7.65x|3.70x |9.09x |8.69x |5.32x |9.06x |9x |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93037
Approved by: https://github.com/lezcano
2023-02-01 10:05:59 +00:00
bdca5fcd43 cherry-picking autodiff support for gather/index_select (#93333)
added gather & index_select in autodiff;
test coverage should be handled by opinfo;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93333
Approved by: https://github.com/ngimel
2023-02-01 09:47:40 +00:00
b484d17c24 _sparse_coo_tensor_with_dims_and_tensors backward: simplify and optimize (#91704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91704
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
2023-02-01 09:02:25 +00:00
6a2838eec5 [jit] jit._drop fun modifier to allow in jit class non-jit decl funs (#93012)
`@torch.jit.unused` and `@torch.jit.ignore` do not allow to keep in torch scripted class member function, that has non scriptable declaration (e.g. return type)

Adding FunctionModifier _DROP to allow fully skip those functions from scripting and keep them in the code of the scripted class.

E.g. it can be used for:

```
@torch.jit._drop
def __fx_create_arg__(self, tracer: torch.fx.Tracer) -> torch.fx.node.Argument:
    # torch.fx classes are not scriptable
    return tracer.create_node(
        "call_function",
        CFX,
        args=(tracer.create_arg(self.features),),
        kwargs={},
    )

def __iter__(self) -> Iterator[torch.Tensor]:
    return iter(self.a)
```

Testing:
Added test case in `test/jit/test_types.py` with non-scriptable type annotations (fx.* classes) that fails before fix and passes after.

```
python test/test_jit.py
```

Differential Revision: [D42774830](https://our.internmc.facebook.com/intern/diff/D42774830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93012
Approved by: https://github.com/davidberard98
2023-02-01 09:02:05 +00:00
994f85d639 sparse_mask: extend lhs to sparse COO tensors (#92248)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92248
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-02-01 09:00:07 +00:00
6a7d6cc30d Introduce core_aten_decompositions (#93131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93131
Approved by: https://github.com/ngimel
2023-02-01 08:35:46 +00:00
f77f88fbc7 [Quant] X86 qengine always uses fbgemm kernels on OS other than Linux (#93218)
**Summary**
X86 quantization backend (qengine) with oneDNN kernels has not been validated on OS other than Linux. So, let it fall back to fbgemm if OS is not Linux. This makes sure the behavior is the same on Windows/Mac as the previous default fbgemm qengine on x86 CPUs.

**Test plan**
CI checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93218
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 08:12:39 +00:00
776079b5bc Fix test_file_system_checkpoint_cpu.py temp directory usage (#93302)
Fixes https://github.com/pytorch/pytorch/issues/93245

This failure starts to happen recently. `tempfile.mkdtemp()` has already created the temporary directory, so removing it with `shutil.rmtree`, then recreating it with `os.makedirs` doesn't make much sense to me.  The flaky problem here is that `shutil.rmtree` could fail to remove the temporary directory sometimes.  Here is the error:

```
======================================================================
ERROR [1.814s]: test_load_rowwise_to_colwise_thread_count_2 (__main__.TestDistributedReshardOnLoad)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 539, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 765, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 810, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 663, in run_test
    getattr(self, test_name)()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 541, in wrapper
    fn()
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 252, in instantiated_test
    test(self, **param_kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/distributed/_shard/sharded_tensor/__init__.py", line 94, in wrapper
    func(self, *args, **kwargs)
  File "/var/lib/jenkins/workspace/test/distributed/checkpoint/test_file_system_checkpoint_cpu.py", line 364, in test_load_rowwise_to_colwise
    os.makedirs(path)
  File "/opt/conda/envs/py_3.8/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/tmp/tmps5rxw4hb'
```

If the temporary directory really needs to be cleaned up, another way would be to remove everything underneath it, but leave the folder alone.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93302
Approved by: https://github.com/kumpera
2023-02-01 07:52:45 +00:00
eea752f853 [Quant][ONEDNN] Fix weight reorder issue for grouped convolution (#91934)
**Summary**
For onednn quant backend only.
QConv weight may be reordered to another blocked format if input shape is changed at runtime. It's a bug that group info is not retained for such reordering. This may lead to wrong shape of weight after reordering. This PR fixes this bug.

**Test plan**
python test/test_quantization.py -k test_conv_reorder_issue_onednn

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91934
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 07:43:53 +00:00
2457d0ef4f [Dynamo][Easy] Remove duplicated code in builder.py (#93809)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93809
Approved by: https://github.com/williamwen42
2023-02-01 07:26:19 +00:00
9daca46dc4 [jit][await] Apply review comments (#93284)
Differential Revision: [D42849920](https://our.internmc.facebook.com/intern/diff/D42849920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93284
Approved by: https://github.com/malfet
2023-02-01 07:22:06 +00:00
feb6c9ae9b Partial revert of autogen view_copy ops which return lists (#93411)
Differential Revision: D42898313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93411
Approved by: https://github.com/larryliu0820
2023-02-01 06:31:58 +00:00
9d1263a88d [ONNX] Fix Gather replacement in RNN peephole (#93120)
From PR: https://github.com/pytorch/pytorch/pull/58691, Replacing the second input of `Gather` 0 to 1 affects other innocent Nodes. In Issue #91526 onnx::range starts from 0, the 0 is changed by this mechanism, as it's shared with onnx::Gather. This PR intends to create a whole independent Constant 0 for replacement. NOTE: The PR passes all existing RNN tests locally in case CI doesn't include RNN test.

~~TODO: test~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93120
Approved by: https://github.com/BowenBao
2023-02-01 06:29:17 +00:00
2cd8cb02a1 [inductor] Don't skip realize heuristics with dynamic shapes (#93814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93814
Approved by: https://github.com/Chillee, https://github.com/ngimel
2023-02-01 06:27:45 +00:00
ac791bddce Refactor dynamo distributed test helpers to be reusable (#93187)
The point is to let Test helpers previously defined and used in `test_dynamo_distributed.py` be used from a new file `test_traceable_collectives.py` later in this stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93187
Approved by: https://github.com/kumpera
2023-02-01 06:09:42 +00:00
60e503d468 [dtensor][6/N] change to a better/safer op registration (#90735)
This PR changes the op registration to a better mechanism, now
we require the directly overload registration instead of the op
key str, this have several benefits:
1. We ensure that the op registration registers the correct op, which
  means it would be faild if the op registration become wrong (this PR
  already fixing several op registration errors as we use direct
  OpOverload registration
2. If the overload name get changed/deleted, we immediately know it at
  the source code compilation level, which is safer
3. This also keep it consistents with the op registration mechanism with
  other tensor subclasses within PyTorch

Differential Revision: [D42876250](https://our.internmc.facebook.com/intern/diff/D42876250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90735
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-02-01 05:06:33 +00:00
42633cf5f9 Inductor cpp wrapper: cache the loading of the kernel (#89742)
### Pitch
Cache the loaded kernel to reduce the overhead.

#### Code before:
```cpp
std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    ...
    auto kernel_cpp_0_lib = dlopen("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so", RTLD_NOW);
    assert(kernel_cpp_0_lib != nullptr);
    void (*kernel_cpp_0)(const float*,const float*,float*,float*);
    *(void **) (&kernel_cpp_0) = dlsym(kernel_cpp_0_lib, "kernel");
    kernel_cpp_0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    ...
}
```

#### Code after:
```cpp
template <typename KernelFunc>
KernelFunc load_cpp_kernel(const char* so_filename) {
    KernelFunc kernel_cpp;
    auto kernel_cpp_lib = dlopen(so_filename, RTLD_NOW);
    assert(kernel_cpp_lib != nullptr);
    *(void **) (&kernel_cpp) = dlsym(kernel_cpp_lib, "kernel");
    return kernel_cpp;
}

std::vector<at::Tensor> call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    ...
    static auto kernel_cpp_0 = load_cpp_kernel<void (*)(const float*,const float*,float*,float*)>("/tmp/torchinductor_xxx/yr/cyr3uymlc6pgvnimx3fnynaa4t7ldafeqzhe5zpizmvorisx4hb2.so");
    kernel_cpp_0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89742
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-01 05:05:50 +00:00
9a56997fe1 [dtensor][5/N] add cached propagator for TP (#90734)
This PR adds a cached propagator for TP use, it caches the sharding
prop decision for the same input sharding on an operator. This could
improve eager mode performance.

Differential Revision: [D42876249](https://our.internmc.facebook.com/intern/diff/D42876249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90734
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-02-01 05:04:08 +00:00
b072245178 [dtensor][4/N] refactor dispatching logic and add propagator (#90733)
This PR refactors the dispatching logic to make it more clean, and
isolate the sharding propagation logic out to a separate class.

This is so that we can implement more complicated propagation features
later.

Differential Revision: [D42876251](https://our.internmc.facebook.com/intern/diff/D42876251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90733
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-02-01 05:02:11 +00:00
965f4ea3ba [Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#91… (#92402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919
Approved by: https://github.com/ezyang

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92402
Approved by: https://github.com/ezyang
2023-02-01 04:47:49 +00:00
79db5bcc9d [vision hash update] update the pinned vision hash (#93323)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93323
Approved by: https://github.com/pytorchbot
2023-02-01 03:41:20 +00:00
e752ec6dea Re-enable xla workflow (#93334)
Re-enables xla workflow after addressing https://github.com/pytorch/xla/issues/4535. The pytorch/xla repo is [green](https://app.circleci.com/pipelines/github/pytorch/xla/16130/workflows/aabf6879-b510-47e1-8abb-b3cf8398957a/jobs/38162) again after GitHub resolved the outage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93334
Approved by: https://github.com/malfet
2023-02-01 02:41:27 +00:00
10910758f4 Make dynamo tests work under pytest (#93251)
This now runs without error:
```
pytest test/dynamo
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93251
Approved by: https://github.com/ezyang, https://github.com/voznesenskym, https://github.com/mlazos
2023-02-01 02:11:52 +00:00
08041c5264 Configurable repro_tolerance for same_two_models (#93398)
Fixes https://github.com/pytorch/pytorch/issues/93293

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93398
Approved by: https://github.com/SherlockNoMad
2023-02-01 01:41:48 +00:00
3bae5484d0 Typofix (#93402)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93402
Approved by: https://github.com/albanD
2023-02-01 01:39:49 +00:00
0f802eedc2 [Quant][FX] Lower QConvAddReLU2d for onednn backend (#91155)
**Summary**
Add quantization mappings for QConvAddReLU2d for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode.

**Test plan**
```
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_onednn
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_by_default
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_lowering
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91155
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 01:18:52 +00:00
e77f28a03d [Quant] Add fused ConvAddReLU2d module for onednn backend (#91154)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused ConvAddReLU2d module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown.

**Test plan**
```
python -m pytest test_quantization.py -k test_conv2d_add_relu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91154
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 01:16:23 +00:00
ef4118e435 [Quant][FX] Lower QConvAdd2d for onednn backend (#91153)
**Summary**
Add quantization mappings for QConvAdd2d for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode.

**Test plan**
```
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_onednn
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_by_default
python -m pytest test_quantization.py -k test_fuse_conv_bn_add_relu_lowering
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91153
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 01:14:12 +00:00
eb9c4c8929 [ONNX] Properly skip tests by onnx version via 'unittest.skipIf' (#93316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93316
Approved by: https://github.com/justinchuby
2023-02-01 01:14:07 +00:00
53c3555a6a [Quant] Add fused ConvAdd2d module for onednn backend (#91152)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `ConvAdd2d` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown.

**Test plan**
```
python -m pytest test_quantization.py -k test_conv2d_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91152
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-02-01 01:11:25 +00:00
7bcc446ede [Vulkan][Optimize for Mobile] Avoid dereferencing element [0] if the vector is empty (#92918)
Summary:
Avoid dereferencing element [0] if the vector is empty.
___

In ```transferInputOutputBackends```, one of the rewrite passes for Vulkan ```optimize_for_mobile```, an out of bounds access happens when trying to insert a backend transfer for an input if that input's ```uses()``` is empty. This diff corrects that issue.

Test Plan:
Run tests
___

Phabricator + CI Tests

Reviewed By: SS-JIA

Differential Revision: D41296037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92918
Approved by: https://github.com/SS-JIA, https://github.com/kirklandsign
2023-02-01 01:09:19 +00:00
e83f473bb7 [BE] Don't use six in torch.utils.tensorboard (#93383)
As PyTorch is Python-3.8+ project only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93383
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ZainRizvi
2023-02-01 00:22:23 +00:00
218d4eac56 Remove submission form (#93287)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93287
Approved by: https://github.com/orionr
2023-01-31 23:41:16 +00:00
8dfcb59d66 Update version of Python to 3.8 in the prerequisites (#93399)
With support of Python 3.7 being deprecated, updating the prerequisites to list Python 3.8 or later.

Fixes #93256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93399
Approved by: https://github.com/atalman, https://github.com/Skylion007
2023-01-31 23:38:19 +00:00
129a1bc715 Minor error in docs regarding execution time (#93258)
The previous sentence seemed to imply that sparse may not always be helpful, ie, your execution time may increase when using sparse. But the docs mentioned otherwise.

A simple re-ordering of two words in the documentation to better align with the contextual sentiment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93258
Approved by: https://github.com/cpuhrsch
2023-01-31 23:32:42 +00:00
7d7c4d9c1f [inductor] Minor fix of addmm shape padding (#93320)
Summary: Minor fix of addmm shape padding

Test Plan: CI

Differential Revision: D42855212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93320
Approved by: https://github.com/jansel
2023-01-31 23:21:22 +00:00
b179a097ea Add platform markers for linux x86_64 only extra_install_requires (#93066)
Like #89924 #91083

#85097 added new extra dependencies on nvidia-*. They are linux x86_64 (GPU) only packages, but were not marked as such, causing issues installing pytorch 1.13 via Poetry (and possibly other tools that follow PyPI's metadata API) on Linux aarch64 systems. This "fixes" the issue by adding the `and platform_machine == 'x86_64'` marker on these dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93066
Approved by: https://github.com/malfet
2023-01-31 22:23:51 +00:00
18c6ca1ee1 Add release matrix to release.md (#93392)
Add Release Compatibility Matrix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93392
Approved by: https://github.com/weiwangmeta, https://github.com/albanD, https://github.com/seemethere
2023-01-31 21:28:02 +00:00
902b4dba75 Change capture_scalar_outputs to use SymInt/SymFloat rather than Tensor to model scalars (#93150)
Previously, Dynamo faked support for item() when `capture_scalar_outputs` was True by representing it internally as a Tensor. With dynamic shapes, this is no longer necessary; we can represent it directly as a SymInt/SymFloat. Do so. Doing this requires you to use dynamic shapes; in principle we could support scalar outputs WITHOUT dynamic shapes but I won't do this unless someone hollers for it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D42885775](https://our.internmc.facebook.com/intern/diff/D42885775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93150
Approved by: https://github.com/voznesenskym
2023-01-31 21:23:23 +00:00
76b683b008 Correctly propagate compiler kwargs to aot minifier (#93308)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93308
Approved by: https://github.com/Chillee, https://github.com/voznesenskym
2023-01-31 20:25:27 +00:00
295fd20eb5 [CI] Add Python-3.11 Linux conda builds (#93186)
This PR almost a no-op, as most of the logic resides in the builder repo, namely:
6342242c50
8f361d91e1

Remove `conda-forge` channel dependency for test job, but add `malfet` channel for 3.11 testing (as numpy is not in default channel yet)
Build and upload following dependencies to `pytorch-nightly` channel:
```
anaconda copy --to-owner pytorch-nightly malfet/numpy/1.23.5
anaconda copy --to-owner pytorch-nightly malfet/numpy-base/1.23.5
anaconda copy --to-owner pytorch-nightly malfet/mkl-service/2.4.0
anaconda copy --to-owner pytorch-nightly malfet/mkl_random/1.2.2
anaconda copy --to-owner pytorch-nightly malfet/mkl_fft/1.3.1

anaconda copy --to-owner pytorch-nightly malfet/sympy/1.11.1
anaconda copy --to-owner pytorch-nightly malfet/mpmath/1.2.1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93186
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2023-01-31 20:24:03 +00:00
811e95a15e --dynamic-ci-skips now works for all backends (#93369)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93369
Approved by: https://github.com/albanD
2023-01-31 20:07:58 +00:00
4d504a9ce8 Fix Windows python3 path (#93387)
If a Windows runner is re-used, python3 should have already been setup.  We will just need to make it available in `GITHUB_PATH`, so subsequent actions can use it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93387
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/seemethere
2023-01-31 19:52:30 +00:00
2a31c3589b Report suppressed exception in minifier (#93368)
Suppressing exceptions is bad!  If you're debugging PyTorch itself
you want to see the exception so you can do something about it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93368
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/bdhirsh
2023-01-31 19:31:50 +00:00
e5235fb62c Convert GuardOnDataDependentSymNode into graph break (#93373)
Extracted from https://github.com/pytorch/pytorch/pull/93150 because
I need it earlier in trunk.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93373
Approved by: https://github.com/Skylion007
2023-01-31 19:31:44 +00:00
44a948c820 Fix MSVC compiler error in basic_ops.h (#93322)
https://github.com/pytorch/pytorch/pull/93069 introduces a compiler error in some internal Windows builds using MSVC:

```
stderr: d:\full-fbsource\xplat\caffe2\torch\csrc\autograd\functions\basic_ops.h(43): fatal error C1001: An internal error has occurred in the compiler.
```
This may be related to older versions of MSVC not recognizing the `[[maybe-unused]]` attribute: https://developercommunity.visualstudio.com/t/compiler-bug-on-parsing-maybe-unused-in-range-base/209488. This PR reverts the changes in `basic_ops.h` that resolves those errors.

Verified this fixes the internal jobs, and landed as [D42854205](https://www.internalfb.com/diff/D42854205).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93322
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-31 19:14:48 +00:00
5b2afaaca8 Fix Vulkan compiling issues on Windows (#92207)
PR based on #61431
Fix USE_VULKAN=1 and USE_VULKAN_WRAPPER=0 not compiling on Windows.
Change designated initializers since they require C++20.
Rename Hasher typename since it's not compiling due to https://developercommunity.visualstudio.com/t/1397858

Fixes #59519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92207
Approved by: https://github.com/ezyang
2023-01-31 18:58:15 +00:00
438f12d91a Rewrite some decomps to allow producing aten ops (#93099)
This introduces a new stop to the decomposition train.
Before reaching prims.view_of, it will stop at aten.alias. Export path wants to get off the train at aten ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93099
Approved by: https://github.com/ngimel
2023-01-31 17:46:20 +00:00
332d55d3df [Dynamo] UserDefinedClassVariable supports python type (#93310)
Fixes #93260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93310
Approved by: https://github.com/mlazos
2023-01-31 17:41:51 +00:00
7b426e8da2 Remove fake tensor cache clearing in dynamo (#93304)
Summary: We originally cleared the cache of the converter to avoid memory leaks; now that the cache uses a weak map this is no longer necessary. Clearing of the cache caused an error in an interaction with the minifier because the minifier uses delayed compilation, so the cleanup had occurred before inductor was invoked.

Test Plan: Memory regression is being checked via dashboard and on master.

Differential Revision: D42858624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93304
Approved by: https://github.com/ezyang
2023-01-31 17:40:15 +00:00
cfff440614 [inductor] Lower fallback kernel warnings from WARNING to INFO (#93330)
Summary:
These are useful to us as developers, or maybe folks working really
closely with us, but they seem kind of unnecessarily alarming to others, even
ML/Torch experts.  E.g.: https://github.com/karpathy/nanoGPT/pull/102

Test Plan: debate

Differential Revision: D42876146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93330
Approved by: https://github.com/soumith, https://github.com/jansel
2023-01-31 17:34:17 +00:00
46c05a7ae3 [ez] Update base branch when updating python docs (#93305)
Every now and then, the python docs push will fail because the base branch (pytorchbot/base) is too old and accumulates commits that might cause the cla check to fail.  Pushing to the base branch will prevent it from being old.

The site branch cannot be used because the following push to site will cause the pr to be closed, preventing us from getting the cla check the next day, which is what happened to https://github.com/pytorch/pytorch.github.io/pull/1157 when I was trying to figure this out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93305
Approved by: https://github.com/huydhn
2023-01-31 17:29:16 +00:00
d72db37c4a Remove a redundant check from code. (#93025)
In file: combinatorics.py, the comparison of Collection length creates a logical short circuit.

   if isinstance(self.sampler, Sized) and len(self.sampler) >= 0:

Here, the right side of the comparison will always return true.

I suggested that the Collection length check should be removed since this is redundant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93025
Approved by: https://github.com/albanD
2023-01-31 16:45:32 +00:00
bb6af061a0 torch.triangular_solve for CSR: materialize diagonal elements when unitriangular=True. (#93352)
Fixes https://github.com/pytorch/pytorch/issues/88890

A temporary fix until MKL is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93352
Approved by: https://github.com/cpuhrsch
2023-01-31 16:33:57 +00:00
d9117b93fb unsqueeze only when dim = 3 (#91052)
unsqueeze is not necessary if use view

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91052
Approved by: https://github.com/albanD
2023-01-31 16:28:23 +00:00
bd4a5b400a [Re-open 90266] [inductor] weight prepack for _convolution_transpose_pointwise (#91955)
Re-open https://github.com/pytorch/pytorch/pull/90266 since earlier pr on that stack got reverted.
Depend on internal ideep upgrade.
[Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91955
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-31 13:28:57 +00:00
cc49f5abd3 [Re-land 90265] [inductor] add conv_transpose2d unary fusion for cpu in inference mode (#91954)
Re-land https://github.com/pytorch/pytorch/pull/90265.
Depend on internal ideep upgrade.
[Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91954
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-31 13:17:53 +00:00
3870fdabfb [Re-land 90264] add conv_transpose2d pointwise(unary) fusion kernel (#91953)
Re-land https://github.com/pytorch/pytorch/pull/90264.
Depend on internal ideep upgrade.
[Update]: internal ideep upgrade issue is resolved in https://github.com/pytorch/pytorch/pull/92239.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91953
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-31 12:58:05 +00:00
fba13d94a1 Remove deprecated torch.symeig (#70988)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`.

- [x] XLA PR: https://github.com/pytorch/xla/pull/4498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988
Approved by: https://github.com/lezcano, https://github.com/kit1980, https://github.com/malfet
2023-01-31 11:59:11 +00:00
ec2461bbd8 Remove proxy tensor's check for data dependent output (#93265)
We'll rely on the underlying fake tensor to raise an error in these cases.  We only raise the error if there is an input to the data dependent operation that is a real tensor (and thus we are at risk of accidentally burning in real values)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93265
Approved by: https://github.com/albanD
2023-01-31 11:58:49 +00:00
d7a3f2128f pass None instead of False inside Adam.__setstate__ (#93289)
with a061f139dc, `fused`'s type hint is `Optional[bool]` and its default value is `None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93289
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2023-01-31 09:41:35 +00:00
af5b01294e [Dynamo] Fix bug if module calls module with static forward function (#93299)
Fix a regression I found from 14k github models(10+ models failed since today), it's because of #93115.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93299
Approved by: https://github.com/williamwen42
2023-01-31 06:16:33 +00:00
91a4947e28 Populate extern_kernels on import (#93282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93282
Approved by: https://github.com/ngimel
2023-01-31 04:52:10 +00:00
8c09a005c5 [inductor] Pattern matching engine (copy) (#93291)
This is an exact duplicate of https://github.com/pytorch/pytorch/pull/90739

The fbcode workflow for landing that diff seems buggy.  The github-export-checks task is failing with credentials errors.  Plan to try to land it using GH1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93291
Approved by: https://github.com/desertfire
2023-01-31 04:51:00 +00:00
aee5f84ac3 [c++] use constexpr instead of const (#93267)
As discussed in https://github.com/pytorch/pytorch/pull/93199#discussion_r1089777684.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93267
Approved by: https://github.com/Skylion007
2023-01-31 04:33:22 +00:00
f9c08e25a1 Fix MacOS nightly builds (#93331)
By setting python_desired version to 3.8

Test Plan: Add `ciflow/binaries_libtorch` and see what will happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93331
Approved by: https://github.com/huydhn
2023-01-31 04:31:28 +00:00
888771dc5d [FSDP][optim_state_dict] Fix _is_named_optimizer when the state is empty (#93303)
Optimizer state is not eager initializaion -- only NamedOptimizer and KeyedOptimizer are. This PR makes it `_is_named_optimizer` work with regular optimizers.

Differential Revision: [D42858589](https://our.internmc.facebook.com/intern/diff/D42858589/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93303
Approved by: https://github.com/fduwjj
2023-01-31 03:49:26 +00:00
441b09d1b7 [CI][ez] Rename some jobs (#93327)
periodic debug builds are actually running against Python-3.10

Remove Python version specifier from libtorch builds, as it kind of
irrelevant (libtorch is C++ only build, so Python version should not
matter)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93327
Approved by: https://github.com/kit1980
2023-01-31 03:02:30 +00:00
sli
524ee07143 Fix https://github.com/pytorch/pytorch/issues/92377 (#92379)
Fixes #92377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92379
Approved by: https://github.com/Chillee
2023-01-31 02:22:16 +00:00
782b9a9cde Use _exchange_device to reduce torch.cuda.device overhead (#91127)
This must wait for the forward compatibility period since it requires the
`cuda::_exchange_device` primitive for TorchScript. Also since TorchScript
doesn't support inheritance, we can't just inherit from `_DeviceGuard` here.

This saves around 2 us per `with` statement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91127
Approved by: https://github.com/ngimel
2023-01-31 01:56:40 +00:00
fc4e9931da [fx.GraphModule] Populate memo in deepcopy BEFORE copying children. (#93295)
Summary:
Apparently if not then at somepoint, we might lose fields if the submodules have circular reference

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93295
Approved by: https://github.com/jerryzh168
2023-01-31 01:45:35 +00:00
21c7c7c72f [Quant] Use the true src zero point to query and create conv pd (#90818)
**Summary**
Previously, we use `DNNL_RUNTIME_S32_VAL` as the `zero point` for `src` in both weight prepack and convolution forward to ensure the same block format of weight is used. The problem is `DNNL_RUNTIME_S32_VAL` may query out a different block format weight comparing with the true `zero point` for `src`. It makes oneDNN convolution into `jit` path instead of `brgconv` path. Here we will use the true `zero point` for `src` to create pd and make reorder if it's a different block format weight as weight prepack generated.

**Test Plan**
```
python -m pytest quantization/core/test_quantized_op.py::TestQuantizedConv::test_conv_transpose_reorder_issue_onednn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90818
Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5, https://github.com/jerryzh168
2023-01-31 01:23:41 +00:00
a71d9a928f [Quant] Add fused conv2d_add_relu op for onednn backend (#90364)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused conv2d_add_relu op for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this op with other quantization backends otherwise an error is thrown.

**Test Plan**
```
python -m pytest test_quantization.py::TestQuantizedConv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90364
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-01-31 01:20:50 +00:00
01687a6bad Revert "add numpy typing plugin to mypy config (#92930)"
This reverts commit 5f1ac188f8dd01a81d0ddeebdbc4d22e25311b72.

Reverted https://github.com/pytorch/pytorch/pull/92930 on behalf of https://github.com/clee2000 due to causing test_doc_examples (main.TestTypeHints) to fail https://github.com/pytorch/pytorch/actions/runs/4049393005/jobs/6965869223 5f1ac188f8, note for revert review: PR was forced merged after first failure, which was flaky
2023-01-31 01:13:01 +00:00
1a454310b9 Update SECURITY.MD (#93313)
To recommend reporting issues via advisories

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93313
Approved by: https://github.com/atalman, https://github.com/seemethere
2023-01-31 00:36:47 +00:00
aeac7f4203 [bazel] Fix gloo.BUILD (#92858)
After the recent gloo submodule bump, bazel build that uses gloo needs a slight update.

Tested that now I was able to build :torch with gloo (on our internal build)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92858
Approved by: https://github.com/dagitses, https://github.com/malfet
2023-01-31 00:22:28 +00:00
5f1ac188f8 add numpy typing plugin to mypy config (#92930)
This added the numpy typing plugin to mypy config so that we could
use it for DeviceMesh typing annotations

Please see https://github.com/pytorch/pytorch/pull/92931 about why we need this. For example, we are currently saving the DeviceMesh's mesh field as torch.Tensor, where when we do sth like:
```python
with FakeTensorMode():
    device_mesh = DeviceMesh("cuda", torch.arange(4))
```
It would throw error because FakeTensorMode or any TorchDispatchMode tracks every tensor creation and interactions. While DeviceMesh just want to save a nd-array to record the mesh topology, and would like to avoid the interaction with subsystems like FakeTensor, so we want to support saving `mesh` as numpy array instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92930
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-01-31 00:13:12 +00:00
2a6e085704 Update custom backend docs (#92721)
Title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92721
Approved by: https://github.com/jansel
2023-01-30 23:54:49 +00:00
c499e760f5 [XNNPACK] Enable Memopt for OSS (#93097)
Summary:
D38543798

Enabled Memopt previously to fix a bug with memory planner

Mirroring the changes we made Internally to OSS

Test Plan: OSS CI

Reviewed By: digantdesai

Differential Revision: D42782958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93097
Approved by: https://github.com/digantdesai
2023-01-30 23:36:41 +00:00
24b501903c Minor sympy usage fix in fbcode (#93171)
Summary: To supports older versions of sympy.

Test Plan:
```
buck2 run @//mode/opt @//mode/inplace -c python.package_style=inplace -c fbcode.enable_gpu_sections=true //caffe2/benchmarks/dynamo:torchbench -- -dcuda --performance --inductor --only hf_T5
```

Differential Revision: D42812188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93171
Approved by: https://github.com/eellison
2023-01-30 23:34:22 +00:00
36fe31f537 [Reland] Refactor stack_trace preservation for node meta preservation (#90803) (#92400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90803
Approved by: https://github.com/jerryzh168, https://github.com/albanD
ghstack-source-id: 5848cca08ef5d6f8868f4f79d8bc29711e9a52c2

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92400
Approved by: https://github.com/jerryzh168
2023-01-30 23:30:43 +00:00
1fa68d40b8 [pytorch] fix backend_type for backend/PG plugin (#93129)
Summary: For backend/PG plugin, use `ProcessGroup.BackendType.CUSTOM` to avoid uninitialized variable during `pg._register_backend` later

Test Plan: CI/CD and internal tests

Differential Revision: D42793222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93129
Approved by: https://github.com/H-Huang
2023-01-30 23:16:08 +00:00
2e9107ec1e [Pytorch][Executorch] Handwritten view copy out ops should resize out (#91194)
Summary: Handwritten out ops should have feature parity with the codegend ones. This means they should resize out to the appropriate size. Q. Why are these handwritten instead of codegend anyway? Q2. Wheres a good spot to put the resize and copy helpers since they are reused in the codegend out kernels

Test Plan: ci.

Differential Revision: D42177051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91194
Approved by: https://github.com/ezyang
2023-01-30 23:07:14 +00:00
7dabb8b53b [vulkan] Enable command buffer reuse and add keys to Tensor/StorageBuffer objects (#92993)
Differential Revision: [D42614180](https://our.internmc.facebook.com/intern/diff/D42614180/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92993
Approved by: https://github.com/salilsdesai
2023-01-30 23:03:07 +00:00
ae79f95cb8 [quant][fx][pt2e][refactor] Refactor prepare.py for upcoming quantize_pt2e changes (#92641)
Summary:
Changes node.meta["target_dtype_info"] to store observer/fake_quant constructors instead of (dtype, is_dynamic),
so that in the future user can provide configure this by themselves, follow up refactors:
(1). generalized structure for "target_dtype_info": right now, we have "input_act_obs_or_fq_ctr", "weight_obs_or_fq_ctr", "bias_obs_or_fq_ctr", "output_obs_or_fq_ctr"
this works OK for current use cases, and users are using a different config to specify which input is weight and which input is bias, to generalize it
we should just expose an api that allow users to specify either a dictionary from input_index to obs_or_fq_ctr, and output_index to obs_or_fq_ctr, e.g.
e.g. out1, (out2, out3) = op(arg0, (arg1, arg2))
"input_act_obs_or_fq_ctr" = {0: obs1, 1: obs2}
"output_act_obs_or_fq_ctr" = {0: obs3, 1: obs4}
note that this would not allow configuring obs/fq for nested structures

or have a config that mimics the structure of arguments and output, e.g. out1, (out2, out3) = op(arg0, (arg1, arg2)), we can have
"input_act_obs_or_fq_ctr" = (obs1, (obs2, obs3))
"output_act_obs_or_fq_ctr" = (obs4, (obs5, obs6))

(2). use these observer/fq directly for inserting observers instead of using qconfig
(3). clean up the TODOs in the code base

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92641
Approved by: https://github.com/jcaip
2023-01-30 22:57:20 +00:00
dd0ba2076a return clone in case of 1 input cat (#93294)
Fixes #93283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93294
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-01-30 22:55:26 +00:00
286cca8929 Add cudnn install 8.7.0.84 for CUDA 11.8 (#93086)
Add cudnn install 8.7.0.84 for CUDA 11.8 .

Same as: https://github.com/pytorch/pytorch/pull/84964
Related to https://github.com/pytorch/builder/pull/1271
Test PR: https://github.com/pytorch/pytorch/pull/92971
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93086
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-01-30 22:53:20 +00:00
0ecb071fc4 [BE][CI] change references from .jenkins to .ci (#92624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92624
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-01-30 22:50:07 +00:00
2b267fa7f2 [inductor] Check memory compression ratio in model tests (#89305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89305
Approved by: https://github.com/weiwangmeta
2023-01-30 22:01:06 +00:00
53a669869c Remove checks for refs/prims (#93250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93250
Approved by: https://github.com/voznesenskym
2023-01-30 21:42:10 +00:00
e17bfde622 [vulkan] Create separate BUCK target for command buffer recording (#92157)
Differential Revision: [D42502843](https://our.internmc.facebook.com/intern/diff/D42502843/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42502843/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92157
Approved by: https://github.com/salilsdesai
2023-01-30 21:34:23 +00:00
710fe40597 [Export] Introduce as_none in ex.Argument union type (#93210)
This design has two implications
- We are **NOT** modeling nullable argument types, e.g `Tesnor?`, `int?`, `int[]?` as a special argument type
- Python None is treated as a special argument type, downstream executor/runtime need know to handle this.

For aten.convolution's schmea, it accepts an optional input: `Tensor? bias`
```
convolution(Tensor input, Tensor weight, Tensor? bias, int[] stride, SymInt[] padding, int[] dilation, bool transposed, SymInt[] output_padding, int groups) -> Tensor
```

Example: notice the **None** argument in the following fx.node

```
convolution_default = torch.ops.aten.convolution.default(arg0, _param_constant0, None, [2, 2], [3, 3], [1, 1], False, [0, 0], 1)
```

would be exported as
```
            Node(
                op='call_function',
                target='aten.convolution.default',
                args=[
                    Argument(as_tensor=TensorArgument(name='arg0')),
                    Argument(
                        as_tensor=TensorArgument(name='_param_constant0')
                    ),
                    Argument(as_none=True),
                    Argument(as_ints=[2, 2]),
                    Argument(as_ints=[3, 3]),
                    Argument(as_ints=[1, 1]),
                    Argument(as_bool=False),
                    Argument(as_ints=[0, 0]),
                    Argument(as_int=1)
                ],
                kwargs={},
                outputs=[
                    ReturnArgument(
                        as_tensor=TensorArgument(name='convolution_default')
                    )
                ],
                metadata='Skipped'
            ),
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93210
Approved by: https://github.com/suo
2023-01-30 21:32:49 +00:00
1d25070949 [Export] Refine design around TensorValue (renamed IValue) (#93217)
See discussion in my comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93217
Approved by: https://github.com/suo
2023-01-30 21:32:32 +00:00
845e4b8a47 [fix] legacybatching: getPhysicalDims (#93261)
Fixes #92985

Minimum Repro:
```python
import torch
from torch._vmap_internals import vmap

input = torch.randn(2, 2)

def fn(x):
    return x.sum(())

o = vmap(fn)(input)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93261
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-01-30 21:06:32 +00:00
7a621c443b [GHF] Fix ghstack branches in sync logic (#93298)
Test plan:
```python
from git_utils import are_ghstack_branches_in_sync,GitRepo
repo=GitRepo("/Users/nshulga/git/pytorch/pytorch")
are_ghstack_branches_in_sync(repo, "gh/SS-JIA/206/head")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93298
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-01-30 21:00:51 +00:00
54056c1705 Update cudnn_frontend to 0.7.3 (#93272)
Updating cudnn_frontend to 0.7.3 To enable CUDNN 8.7 integration

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93272
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-01-30 20:45:00 +00:00
c516e5488e Move bazel and xla to unstable (#93296)
Fixes #ISSUE_NUMBER
currently they are failing due things like
```

ERROR: An error occurred during the fetch of repository 'tf_runtime':
   Traceback (most recent call last):
	File "/var/lib/jenkins/workspace/xla/third_party/tensorflow/third_party/repo.bzl", line 73, column 33, in _tf_http_archive_impl
		ctx.download_and_extract(
Error in download_and_extract: java.io.IOException: Error downloading [3367783466.tar.gz, 3367783466.tar.gz] to /home/jenkins/.cache/bazel/_bazel_jenkins/b463291cb8b07b4bfde1e3a43733cd1a/external/tf_runtime/temp17509854002229755553/3367783466dff91b8b283d61c7fe8abc9e7bbb80.tar.gz: Checksum was 4d2fc38d8b6edd1a478ea2fcb88491eeaf7378e5ffe9f4e3eb3b821df1d1c5ba but wanted 5e6bab71ce31b4b56105ac4567f8bffa5f5b3de7ad3064638297249e69375623
```
so I move to unstable until we investigate and fix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93296
Approved by: https://github.com/huydhn
2023-01-30 20:15:41 +00:00
4fc19e1a71 [optim][adam] use fastest impl whenever possible, add util (#93184)
This allows it so that ONLY when the users don't set anything for foreach or fused do we switch the default and cascades adam so that we default to fused, then foreach, then single-tensor.

To clarify:
* if the user puts True in foreach _only_, it will run the foreach implementation.
* if the user puts True in fused _only_, it will run the fused implementation.
* if the user puts True in foreach AND for fused, it will run the fused implementation.

And:
* if the user puts False in foreach _only_, it will run the single tensor implementation.
* if the user puts False in fused _only_, it will still run the single tensor implementation.
* if the user puts False in foreach AND for fused, it will run the single tensor implementation.

I also didn't trust myself that much with the helper function, so I ran some local asserts on _default_to_fused_or_foreach. The only point left to really test is the type(p) -- torch.Tensor but I think the distributed tests will catch that in CI.
```
cuda_only_fp_list = [
    torch.rand((1, 2), device="cuda", dtype=torch.float32),
    torch.rand((1, 2), device="cuda", dtype=torch.float64),
    torch.rand((1, 2), device="cuda", dtype=torch.float16),
    torch.rand((1, 2), device="cuda", dtype=torch.bfloat16),
]

cuda_only_int_list = [
    torch.randint(1024, (1, 2), device="cuda", dtype=torch.int64),
]

cpu_list = [
    torch.rand((1, 2), device="cpu", dtype=torch.float32),
    torch.rand((1, 2), device="cpu", dtype=torch.float64),
    torch.rand((1, 2), device="cpu", dtype=torch.float16),
]

none_list = [None]

# differentiable should always make it return false for both
assert _default_to_fused_or_foreach([cuda_only_fp_list], True, True) == (False, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list], True, False) == (False, False)

# cpu lists should always make it return false for both
assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, True) == (False, False)
assert _default_to_fused_or_foreach([cpu_list], False, True) == (False, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, False) == (False, False)
assert _default_to_fused_or_foreach([cpu_list], False, False) == (False, False)

# has fused triggers correctly
assert _default_to_fused_or_foreach([cuda_only_fp_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list], False, False) == (False, True)

# ints always goes to foreach
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, True) == (False, True)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, False) == (False, True)

# Nones don't error
assert _default_to_fused_or_foreach([cuda_only_fp_list, none_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list, none_list], False, True) == (False, True)
assert _default_to_fused_or_foreach([none_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([none_list], False, False) == (False, True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93184
Approved by: https://github.com/albanD
2023-01-30 19:58:55 +00:00
efee879695 Don't suppress warnings in CI. (#93269)
Warnings are an important clue that something bad is going on.
You want to see them in logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93269
Approved by: https://github.com/voznesenskym
2023-01-30 19:21:09 +00:00
5d9902cbcd Beef up error when converting sympy expr to int/float/bool fails (#93198)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93198
Approved by: https://github.com/albanD
2023-01-30 18:35:52 +00:00
2fc73622f8 [jit] Support Awaitable type (#90863)
We want to make TorchRec sharded models TorchScriptable.

TorchRec sharded models uses generic types Awaitable[W] and LazyAwaitable[W] (https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/types.py#L212).
In sharded model those types are used instead of contained type W, having the initialization function that produces object of type W.

At the moment when the first attribute of W is requested - `LazyAwaitable[W]` will call its initialization function (on the same stack), cache the result inside and work transparently as an object of W. So we can think about it as a delayed object initialization.

To support this behavior in TorchScript - we propose a new type to TorchScript - `Await`.
In eager mode it works the same as `LazyAwaitable[W]` in TorchRec, being dynamically typed - acting as a type `W` while it is `Await[W]`.

Within torchscript it is `Await[W]` and can be only explicitly converted to W, using special function `torch.jit.awaitable_wait(aw)`.
Creation of this `Await[W]` is done via another special function `torch.jit.awaitable(func, *args)`.

The semantic is close to `torch.jit.Future`, fork, wait and uses the same jit mechanics (inline fork Closures) with the difference that it does not start this function in parallel on fork. It only stores as a lambda inside IValue that will be called on the same thread when `torch.jit.awaitable_wait` is called.

For example (more examples in this PR `test/jit/test_await.py`)
```
      def delayed(z: Tensor) -> Tensor:
          return Tensor * 3

      @torch.jit.script
      def fn(x: Tensor):
          aw: Await[int] = torch.jit._awaitable(delayed, 99)
          a = torch.eye(2)
          b = torch.jit._awaitable_wait(aw)
          return a + b + x
```

Functions semantics:

`_awaitable(func -> Callable[Tuple[...], W], *args, **kwargs) -> Await[W]`

Creates Await object, owns args and kwargs. Once _awaitable_wait calls, executes function func and owns the result of the function. Following _awaitable_wait calls will return this result from the first function call.

`_awaitable_wait(Await[W]) -> W`
Returns either cached result of W if it is not the first _awaitable_wait call to this Await object or calls specified function if the first.

`_awaitable_nowait(W) -> Await[W]`

Creates trivial Await[W] wrapper on specified object To be type complaint for the corner cases.

Differential Revision: [D42502706](https://our.internmc.facebook.com/intern/diff/D42502706)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90863
Approved by: https://github.com/davidberard98
2023-01-30 17:38:59 +00:00
53f7fb9a22 Add CSC->BSC conversion (#92307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92307
Approved by: https://github.com/cpuhrsch
2023-01-30 17:03:36 +00:00
434eb16deb Correctly restore pybind11 error_already_set (#93238)
We would handle py::error_already_set correctly from pybind11 bindings,
but not from our regular TH bindings, which meant that anything from
an inner pybind11 function call was getting unconditionally transformed
into a RuntimeError.  Not too many cases where we do this, but
PySymNodeImpl was one of them.

To test this, I need to raise a non-RuntimeError from a function which
is invoked from pybind11 and then propagated to a non-pybind11 call
site.  I introduce GuardOnDataDependentSymNode for expressly this
purpose (this is how I discovered the bug anyway.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93238
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-30 16:43:01 +00:00
3e4d0e8d82 [Reland][FSDP] Do not clean FQNs for use_orig_params=True (#92662)
The last PR (https://github.com/pytorch/pytorch/pull/91767/) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92662
Approved by: https://github.com/rohan-varma
2023-01-30 16:07:44 +00:00
c7b03010ec Split the aot/dynamo TORCHDYNAMO_REPRO_AFTER cases (#93226)
I often copy paste this line and it is annoying to have to modify
the inside to select aot/dynamo

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93226
Approved by: https://github.com/desertfire
2023-01-30 14:23:16 +00:00
9eb402d18e Update dynamic benchmark skips (#93228)
Data from https://github.com/pytorch/pytorch/pull/93223

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93228
Approved by: https://github.com/desertfire
2023-01-30 14:22:53 +00:00
04082fc042 [inductor] enable more dynamic shapes tests (#93216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93216
Approved by: https://github.com/ezyang
2023-01-30 09:05:45 +00:00
5112f44dc4 Add vmap support for torch.index_fill (#91364)
Fixes #91177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91364
Approved by: https://github.com/zou3519
2023-01-30 08:08:33 +00:00
08035b1eb9 inductor: support more conv+unary fusion (#92518)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92518
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-30 07:21:50 +00:00
cyy
4d51c8532c Some simple fixes (#93221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93221
Approved by: https://github.com/Skylion007
2023-01-30 05:14:03 +00:00
e790281a85 SymInt'ify view_as (#93242)
Follow up to #93241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93242
Approved by: https://github.com/ezyang
2023-01-30 01:56:50 +00:00
3c570a2be3 SymInt'ify reshape_as (#93241)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93241
Approved by: https://github.com/Skylion007
2023-01-30 01:46:16 +00:00
0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
239afa0e43 Revert accidental change to libkineto version (#93237)
Introduced by https://github.com/pytorch/pytorch/pull/93155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93237
Approved by: https://github.com/Skylion007
2023-01-29 23:14:14 +00:00
b3e422948d [Dynamo] Support out variants of ops mutate the tensors out of the function frame (#93177)
Fixes #93136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93177
Approved by: https://github.com/jansel
2023-01-29 22:22:58 +00:00
129f136179 Move Sherlock to snooping dynamic shapes (#93239)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93239
Approved by: https://github.com/kit1980
2023-01-29 20:22:56 +00:00
5976f0bdfe Set min supported Python version to 3.8 (#93155)
Also, grep for `if sys.version_info .cond. (3, 8)` and replaces them with appropriate action.

This is a last in a series of PRs that moved CI/CD away from testing PyTorch behavior against Python-3.7.

Fixes https://github.com/pytorch/pytorch/issues/80513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93155
Approved by: https://github.com/huydhn
2023-01-29 18:28:46 +00:00
0dceaf07cd Add two decomps for optimizer fusion (#93193)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93193
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-01-29 10:36:43 +00:00
878f4f09d2 Warn about deprecation of private decoder builtins (#93181)
Summary: Warn about deprecation of private decoder builtins

Test Plan: sandcastle & github CI

Differential Revision: D42816960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93181
Approved by: https://github.com/drisspg
2023-01-29 09:34:20 +00:00
304d8dd6c8 [Dynamo] Support enum.Enum type as dict key (#93026)
Fixes Meta internal user case of using ```enum.Enum``` type as dict key, pleaser refer the added test case for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93026
Approved by: https://github.com/mlazos
2023-01-29 06:37:10 +00:00
9a2becf60a inductor: fix inplace op's wrong lowering issue when preop is NopKernel (#92247)
For TIMM ghostnet_100, there has such case, concat+inplace_add:

```
import torch
from torch._inductor import config
config.debug = True
torch._dynamo.config.verbose=True

class MockModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, y, z):
        out = torch.cat([x, y], dim=1)
        out+=z
        return out

mod = MockModule().eval()
inputs = (
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 64, 16, 16]),
                torch.randn([1, 128, 16, 16]),
            )
ref = mod(*inputs)

with torch.no_grad():
    opt_model = torch._dynamo.optimize('inductor')(mod)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
    out = opt_model(*inputs)
print(torch.equal(ref, out))
```

the inductor always get a wrong result, I find that inductor get a wrong code:

```

from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       const float* __restrict__ in_ptr2,
                       const float* __restrict__ in_ptr3,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
            tmp0.store(out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr0[i0];
            out_ptr0[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<1024; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
            tmp0.store(out_ptr1 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=16384; i0<16384; i0+=1)
        {
            auto tmp0 = in_ptr1[i0];
            out_ptr1[i0] = tmp0;
        }
    }
    {
        for(long i0=0; i0<2048; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr2 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr3 + 16*i0);
            auto tmp2 = tmp0 + tmp1;
            tmp2.store(out_ptr2 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=32768; i0<32768; i0+=1)
        {
            auto tmp0 = in_ptr2[i0];
            auto tmp1 = in_ptr3[i0];
            auto tmp2 = tmp0 + tmp1;
            out_ptr2[i0] = tmp2;
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    buf3 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    buf0 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1))  # alias
    buf1 = as_strided(buf3, (1, 64, 16, 16), (32768, 256, 16, 1), 16384)  # alias
    buf2 = empty_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(arg1_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr()))
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((1, 64, 16, 16), (16384, 256, 16, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((1, 128, 16, 16), (32768, 256, 16, 1), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1, arg1_1, arg2_1]))

```
you can see that the add operation always adds a random value, see the ir code:

1. **ir_pre_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```
2. **ir_post_fusion.txt**
```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf3']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf3']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,))]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store
```

From the ir code, you can see the buf3 always adds an empty buf2 which has never been written. The root cause is that there has a potential issue when doing the mutation for inplace add when its' input is a NopKernel.

After this PR, the ir will be like(**ir_pre_fusion.txt**):

```
buf0: SchedulerNode(ComputedBuffer)
buf0.writes = [MemoryDep(name='buf0', index=c0, size=(16384,))]
buf0.unmet_dependencies = []
buf0.met_dependencies = [MemoryDep(name='arg0_1', index=c0, size=(16384,))]
buf0.group.device = cpu
buf0.group.iteration = ((16384,), ())
buf0.sizes = ([16384], [])
buf0.aliases = ['buf2']
class buf0_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf0', get_index_1, load, None)
        return store

buf1: SchedulerNode(ComputedBuffer)
buf1.writes = [MemoryDep(name='buf1', index=c0, size=(16384,))]
buf1.unmet_dependencies = []
buf1.met_dependencies = [MemoryDep(name='arg1_1', index=c0, size=(16384,))]
buf1.group.device = cpu
buf1.group.iteration = ((16384,), ())
buf1.sizes = ([16384], [])
buf1.aliases = ['buf2']
class buf1_loop_body:
    var_ranges = {z0: 16384}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg1_1', get_index)
        get_index_1 = self.get_index('index0')
        store = ops.store('buf1', get_index_1, load, None)
        return store

buf2: NopKernelSchedulerNode(ConcatKernel)
buf2.writes = [StarDep(name='buf2')]
buf2.unmet_dependencies = [StarDep(name='buf0'), StarDep(name='buf1')]
buf2.met_dependencies = []

buf3: SchedulerNode(ComputedBuffer)
buf3.writes = [MemoryDep(name='buf3', index=c0, size=(32768,))]
buf3.unmet_dependencies = [MemoryDep(name='buf2', index=c0, size=(32768,)), StarDep(name='buf2')]
buf3.met_dependencies = [MemoryDep(name='arg2_1', index=c0, size=(32768,))]
buf3.group.device = cpu
buf3.group.iteration = ((32768,), ())
buf3.sizes = ([32768], [])
buf3.mutations = ['buf2']
class buf3_loop_body:
    var_ranges = {z0: 32768}
    index0 = z0
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('buf2', get_index)
        get_index_1 = self.get_index('index0')
        load_1 = ops.load('arg2_1', get_index_1)
        add = ops.add(load, load_1)
        get_index_2 = self.get_index('index0')
        store = ops.store('buf3', get_index_2, add, None)
        return store

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92247
Approved by: https://github.com/ngimel, https://github.com/desertfire, https://github.com/jansel
2023-01-29 05:35:21 +00:00
900f8886e2 inductor: make as_strided support non-contiguous input and always fix it's input layout using eager stride (#92063)
GIven the following small case:

```
import torch
import torch._dynamo

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        return torch.as_strided(x + 1, (8, 384, 2, 20, 12), (153600, 1, 61440, 384, 7680))+ 2

x = torch.randn(8, 384, 20, 20).to(memory_format=torch.channels_last)
model= Model().eval()
model = model.to(memory_format=torch.channels_last)
ref = model(x)

with torch.no_grad():
    opt_model = torch._dynamo.optimize('inductor')(model)

with torch.no_grad():
    for i in range(2):
        y1 = opt_model(x)

print(torch.equal(ref, y1))

```

inductor always gets a wrong result:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<8; i0+=1)
            {
                #pragma GCC ivdep
                for(long i1=0; i1<384; i1+=1)
                {
                    #pragma GCC ivdep
                    for(long i2=0; i2<400; i2+=1)
                    {
                        auto tmp0 = in_ptr0[i1 + (384*i2) + (153600*i0)];
                        auto tmp1 = static_cast<float>(1);
                        auto tmp2 = tmp0 + tmp1;
                        out_ptr0[i2 + (400*i1) + (153600*i0)] = tmp2;
                    }
                }
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long i0=0; i0<8; i0+=1)
            {
                for(long i1=0; i1<2; i1+=1)
                {
                    for(long i2=0; i2<5760; i2+=1)
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + (16*i2) + (61440*i1) + (153600*i0));
                        auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(2));
                        auto tmp2 = tmp0 + tmp1;
                        tmp2.store(out_ptr1 + (16*i2) + (92160*i1) + (184320*i0));
                    }
                    #pragma omp simd simdlen(8)
                    for(long i2=92160; i2<92160; i2+=1)
                    {
                        auto tmp0 = out_ptr0[i2 + (61440*i1) + (153600*i0)];
                        auto tmp1 = static_cast<float>(2);
                        auto tmp2 = tmp0 + tmp1;
                        out_ptr1[i2 + (92160*i1) + (184320*i0)] = tmp2;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 384, 20, 20), (153600, 400, 20, 1), device='cpu', dtype=torch.float32)
    buf1 = empty_strided((8, 384, 2, 20, 12), (184320, 1, 92160, 384, 7680), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg0_1
    return (buf1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((8, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))

```

the reason is that there always convert the input to a contiguous layout at **as_strided** lowering step, which is not aligned with the eager model input stride.

```
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[8, 384, 20, 20]):
        # File: model_test.py:52, code: return torch.as_strided(x + 1, (8, 384, 2, 20, 12), (153600, 1, 61440, 384, 7680))+ 2
        add: f32[8, 384, 20, 20] = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
        as_strided: f32[8, 384, 2, 20, 12] = torch.ops.aten.as_strided.default(add, [8, 384, 2, 20, 12], [153600, 1, 61440, 384, 7680]);  add = None
        add_1: f32[8, 384, 2, 20, 12] = torch.ops.aten.add.Tensor(as_strided, 2);  as_strided = None
        return (add_1,)

```

This PR will always fix **as_strided** stride with eager model's stride, and also make **as_strided** support channels_last input:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<76800; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
                auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(1));
                auto tmp2 = tmp0 + tmp1;
                tmp2.store(out_ptr0 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=1228800; i0<1228800; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = static_cast<float>(1);
                auto tmp2 = tmp0 + tmp1;
                out_ptr0[i0] = tmp2;
            }
        }
        {
            #pragma omp for  collapse(2)
            for(long i0=0; i0<8; i0+=1)
            {
                for(long i1=0; i1<2; i1+=1)
                {
                    for(long i2=0; i2<5760; i2+=1)
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + (16*i2) + (61440*i1) + (153600*i0));
                        auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(2));
                        auto tmp2 = tmp0 + tmp1;
                        tmp2.store(out_ptr1 + (16*i2) + (92160*i1) + (184320*i0));
                    }
                    #pragma omp simd simdlen(8)
                    for(long i2=92160; i2<92160; i2+=1)
                    {
                        auto tmp0 = out_ptr0[i2 + (61440*i1) + (153600*i0)];
                        auto tmp1 = static_cast<float>(2);
                        auto tmp2 = tmp0 + tmp1;
                        out_ptr1[i2 + (92160*i1) + (184320*i0)] = tmp2;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((8, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32)
    buf1 = empty_strided((8, 384, 2, 20, 12), (184320, 1, 92160, 384, 7680), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg0_1
    return (buf1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((8, 384, 20, 20), (153600, 1, 7680, 384), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92063
Approved by: https://github.com/jansel
2023-01-29 05:30:59 +00:00
cac1912bfb Add some more missing moves to aten functorch (#93098)
Add a couple of additional moves to aten functorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93098
Approved by: https://github.com/ezyang
2023-01-29 04:50:57 +00:00
61fd1188ba [Export] Remove the concept of Scalar in export schema (#93211)
Scalar is a union type of [int, float, bool], it's only needed for the representation of operation schema.

During export, we always have the concrete argument. As ex.Argument is already an union type, we don't need Scalar type anymore.

Example
Here's the schema for aten.add.Scalar
```
add.Scalar(Tensor self, Scalar other, Scalar alpha=1) -> Tensor
```
A fx.node
```
add_tensor: f32[s0, s0] = torch.ops.aten.add.Scalar(arg0, 1.1)
```

would be exported as
```
            Node(
                op='call_function',
                target='aten.add.Tensor',
                args=[
                    Argument(as_tensor=TensorArgument(name='arg0')),
                    Argument(as_float=1.1)
                ],
                outputs=[
                    ReturnArgument(as_tensor=TensorArgument(name='add_tensor'))
                ]
            )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93211
Approved by: https://github.com/suo
2023-01-29 04:50:32 +00:00
68a1065bd7 [Export] Remove op filed from ex.Node schema (#93208)
Node can only be 'call_function' ops
'placeholder' and 'output' are serialized as inputs and outputs of the Graph
'get_attr' is not needed anymore, as it's an implicit lookup from GraphModule's parameters/buffers
'call_method' and 'call_module' is not supported, as it's not used in the canonical FX Graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93208
Approved by: https://github.com/suo, https://github.com/Neilblaze
2023-01-29 04:35:46 +00:00
7cc91f4002 [vision hash update] update the pinned vision hash (#93189)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93189
Approved by: https://github.com/pytorchbot
2023-01-29 03:33:34 +00:00
cb817d6176 Fix endian handling in THPStorage_fromBuffer (#92834)
Fixes #92831

This PR fixes a test failure of `TestTorch.test_from_buffer` on a big-endian machine. The root cause of this failure is that current `THPStorage_fromBuffer` does not perform endian handling correctly on a big-endian.

In `THPStorage_fromBuffer`, the given buffer is stored as machine native-endian. Thus, if the specified byte order (e.g. `big`) is equal to machine native-endian, swapping elements should not be performed. However, in the current implementation, [`decode*BE()`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/utils/byte_order.cpp#L72-L109) always swaps elements regardless of machine native-endian (i.e. these methods assume buffer is stored as little-endian).

Thus, this PR uses the following approaches:
- if the specified byte order (e.g. `big`) is equal to machine native-endian, call `decode*LE()` that does not swap elements by passing `torch::utils::THP_LITTLE_ENDIAN` to `THP_decode*Buffer()`.
- if the specified byte order (e.g. `big`) is not equal to machine native-endian, call `decode*BE()` that always swap elements by passing `torch::utils::THP_BIG_ENDIAN` to `THP_decode*Buffer()`.

After applying this PR to the master branch, I confirmed that the test passes on a big-endian machine.

```
% python test/test_torch.py TestTorch.test_from_buffer
/home/ishizaki/PyTorch/master/test/test_torch.py:6367: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self.assertEqual(torch.ByteStorage.from_buffer(a).tolist(), [1, 2, 3, 4])
...
/home/ishizaki/PyTorch/master/test/test_torch.py:6396: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  self.assertEqual(bytes.tolist(), [1, 2, 3, 4])
.
----------------------------------------------------------------------
Ran 1 test in 0.021s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92834
Approved by: https://github.com/ezyang
2023-01-29 00:55:54 +00:00
cyy
1e0c57b645 More fixes found in tidy and libc++ (#93138)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93138
Approved by: https://github.com/Skylion007
2023-01-28 20:55:16 +00:00
4ca511c69e Fix positional issues in dedup guards (#93137)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93137
Approved by: https://github.com/bertmaher, https://github.com/wconstab, https://github.com/bdhirsh
2023-01-28 19:21:32 +00:00
ef988c2b37 Add post cleanup step for MacOS (#93126)
This goes together with https://github.com/pytorch/test-infra/pull/1548 to clean up MacOS M1 runner after the workflow finishes.  I'm referring to my test branch here to test https://github.com/pytorch/test-infra/pull/1548.  Once that PR is merged, I will switch to the main branch, i.e. `pytorch/test-infra/.github/actions/setup-miniconda@main` and `pytorch/test-infra/.github/actions/check-disk-space@main`

In the future, if there are more steps need to be done after MacOS workflow finishes, this can be also be refactored into a separate action like `teardown-linux`.  There is only one step at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93126
Approved by: https://github.com/ZainRizvi
2023-01-28 17:53:20 +00:00
cfb160185e Update ROCm CI builds to 5.4.2 (#93163)
PR https://github.com/pytorch/pytorch/pull/92972 was meant to upgrade to ROCm5.4.2, not ROCm5.4. This PR rectifies that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93163
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2023-01-28 17:34:51 +00:00
648202ceb9 Improve DDPOptimizer by avoiding small preamble graph (#93162)
This optimizes an edge case where some compute-only ops (e.g. add)
could end up in an orphan graph at the input side due to the bucket
for the next graph being full already.  The fix is to fuse this
graph (which is "empty" in parameter count) together with the adjoining
"full" bucket.

Note: i encountered this when trying to repro some suspected duplicate
argument errors, but this is unrelated and I have not yet repro'd
a duplicate arg issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93162
Approved by: https://github.com/davidberard98
2023-01-28 15:33:53 +00:00
f40183d374 Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally (#93192)
Fix C10_CUDA_CHECK for failing to capture last cuda error occasionally

This error was accidentally introduced by #92227, which was trying to fix_ #91758 as introduced in #85256.

The unit test `TestCuda.test_events_multi_gpu_elapsed_time` has been failed since that PR got merged (in cuda 11.8 and cuda 12.0). That test requires >=2 GPU, so it's probably not tested in the OSS CI?
```
python test/test_cuda.py -v -k TestCuda.test_events_multi_gpu_elapsed_time
```

E.g. in https://github.com/pytorch/pytorch/actions/runs/4026926691/jobs/6922406192
```
2023-01-27T19:41:32.2312162Z   test_events_multi_gpu_elapsed_time (__main__.TestCuda) ... skip: detected only one GPU (0.001s)
```

The original C10_CUDA_CHECK before #85256 has an extra `cudaGetLastError` that captures those cuda errors, https://github.com/pytorch/pytorch/pull/85256/files#diff-0823e63e781acf56e93a5553ed7feee0db0bda05d86e2560c7b80e87e32e0024L41-L42

This extra `cudaGetLastError` was originally introduced in #17337. As commented here https://github.com/pytorch/pytorch/pull/17337/files#r259104503

> soumith on Feb 21, 2019:
Without this, a previously raised error was still lingering and falsely being triggered for a subsequent CUDA call. colesbury suggested that this is the right thing to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93192
Approved by: https://github.com/ezyang
2023-01-28 09:06:10 +00:00
aac9e5288f Increase test multiprocessing waiting time (#93183)
Fixes https://github.com/pytorch/pytorch/issues/67002

This is a follow-up from https://github.com/pytorch/pytorch/pull/91459 which fixed the flaky test everywhere excepts ROCm and MacOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93183
Approved by: https://github.com/clee2000
2023-01-28 07:59:59 +00:00
72502b94f3 correct use of torch.backends.cudnn.flags() (#93182)
Fixes #77467.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93182
Approved by: https://github.com/ngimel
2023-01-28 06:50:06 +00:00
a62fc09a1f [Quant] Add fused conv2d_add op for onednn backend (#90262)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `conv2d_add` op for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this op with other quantization backends otherwise an error is thrown.

**Test Plan**
```
python -m pytest test_quantization.py::TestQuantizedConv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90262
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-01-28 06:30:29 +00:00
00b3f22210 Add missing scalar example in docs of torch.where (#93145)
[`torch.where(condition, x, y)`](https://pytorch.org/docs/stable/generated/torch.where.html) accepts `x` and `y` as either `Tensor` or Scalar, but the Scalar example is missing in the docs. I simply add the example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93145
Approved by: https://github.com/ngimel
2023-01-28 03:46:44 +00:00
ca8f5e177a Use the old aten underscored function for Predictor (#93096)
Summary:
Errors reported via https://fb.prod.workplace.com/groups/1405155842844877/permalink/6644919482201794/

The problem is that the scriptable op set between predictor and the latest build of master is different.

Test Plan: Sandcastle testing

Differential Revision: D42786069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93096
Approved by: https://github.com/mikekgfb
2023-01-28 03:14:18 +00:00
189ae948d3 [CI] Move XLA to Python-3.8 (#93178)
Depends on https://github.com/pytorch/xla/pull/4527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93178
Approved by: https://github.com/huydhn
2023-01-28 02:58:18 +00:00
2f0b0c5dd7 exponential_ few fixes (1) lambda > 0 (2) mkl kernel to continuous (3) better error log on dtype (#92891)
Exponential distribution is continuous. Fixes CPU MKL exponential implementation to exclude integer dtypes.

```python
import torch
dtypes = [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64]

for dtype in dtypes:
    x = torch.empty(10000, dtype=dtype).exponential_() # should fail !
    print("dtype: ", x.dtype, "sum: ", x.sum())
```

### Additional Context

Related to #92709. This issue propagates to OpInfo of exponential.

```
AssertionError: The supported dtypes for exponential on device type cpu are incorrect!
The following dtypes worked in forward but are not listed by the OpInfo: {torch.int64, torch.uint8, torch.int8, torch.int16, torch.int32}.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92891
Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/ngimel
2023-01-28 02:27:16 +00:00
42d4eca796 Update submodule kineto fix bazel1 (#92318)
Update kineto submodule and fix bazel build issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92318
Approved by: https://github.com/aaronenyeshi
2023-01-28 02:26:28 +00:00
b74a0fc486 Mark aten.flip and aten.alias as core aten op (#93130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93130
Approved by: https://github.com/qihqi, https://github.com/zhxchen17
2023-01-28 00:41:35 +00:00
4d107e3426 torch.export Logical Schema V1 (#93135)
This PR is for landing the initial version of logical schema.

See previous discussions in https://github.com/pytorch/pytorch/pull/91287

This is a starting point for iterations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93135
Approved by: https://github.com/suo
2023-01-28 00:35:06 +00:00
1ff292abe0 Make CPU inductor work with dynamic shapes (#93077)
These errors were found by looking at wav2vec2

See https://github.com/pytorch/pytorch/issues/91719

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93077
Approved by: https://github.com/voznesenskym, https://github.com/ngimel
2023-01-27 23:18:55 +00:00
a0ca9dc8ca [torchgen] Small fix for empty yaml file edge case (#92938)
Rely on CI.

Avoid issues such as:

```
Traceback (most recent call last):
  File "<string>", line 38, in <module>
  File "<string>", line 36, in __run
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 690, in <module>
    main()
  File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 626, in main
    parsed_yaml, custom_ops_parsed_yaml = parse_yaml_files(
  File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 505, in parse_yaml_files
    translate_native_yaml(
  File "/re_cwd/buck-out/v2/gen/fbcode/2841b324ed9b88dd/caffe2/torchgen/__gen_executorch__/gen_executorch#link-tree/torchgen/gen_executorch.py", line 448, in translate_native_yaml
    for e in native_es:
TypeError: 'NoneType' object is not iterable
```

Differential Revision: [D42729435](https://our.internmc.facebook.com/intern/diff/D42729435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92938
Approved by: https://github.com/JacobSzwejbka
2023-01-27 22:45:21 +00:00
75cfc0be21 Logcumsumexp for CPU (#93153)
Partial work from #90847, in the direction of solving #89205.
Most of the content is from #90847, but this is only for CPU, so hopefully it does not increase the build time by a lot.

tag: @albanD, @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93153
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-01-27 22:29:33 +00:00
61457671a5 [quant][fx][be] Remove _input_output_observed from backend_config (#92589)
Summary:
This is no longer needed, we can use dtype to decide whether an observer is needed or not

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92589
Approved by: https://github.com/jcaip
2023-01-27 22:17:05 +00:00
58acab4616 [dynamo] support [tensor].type(torch.FloatTensor) (#93043)
for some tensor x, x.type(torch.FloatTensor) will essentially do the same thing as x.to(torch.float). x.type can be called with at least 3 types of inputs:
* a string "torch.FloatTensor"
* a dtype torch.float
* a tensor type torch.FloatTensor

the third option (torch.FloatTensor) fails in fx, because fx cannot trace torch.FloatTensor objects.  So this PR will replace the torch.FloatTensor type with a string "torch.FloatTensor"

Why not fix this in fx? Well, it's possible, but I'm not sure a nice way to do it. We would want to update [torch.fx.node.BaseArgumentTypes](d88bc38b0c/torch/fx/node.py (L17)) to contain torch.FloatTensor etc. We could hard-code a list of tensor types there (the types vary depending on build type, e.g. whether or not cuda tensors are available), but that's not great in case our hardcoded list differs from the actual list registered by python_tensor.cpp. Another option is to dynamically populate the list of types with `Union[tuple(...)])`, and fill the tuple with `torch._tensor_classes` (which is directly populated by python_tensor.cpp), but apparently this breaks most typecheckers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93043
Approved by: https://github.com/jansel
2023-01-27 21:27:13 +00:00
35ea82541b Send float32 to a different GitHub issue (#93168)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93168
Approved by: https://github.com/Chillee, https://github.com/jansel
2023-01-27 19:55:06 +00:00
65d6802e2f Improve error messages for sparse methods on tensors with unsupported backends/layouts. (#93149)
Fixes https://github.com/pytorch/pytorch/issues/92790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93149
Approved by: https://github.com/cpuhrsch
2023-01-27 19:50:23 +00:00
27ab1dfc28 Remove print_test_stats, test_history, s3_stat_parser (#92841)
Pritam Damania no longer uses it (and is no longer with FB), and I don't know who else has interest in this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92841
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/seemethere
2023-01-27 18:11:42 +00:00
975feb606e [DDP][Easy] Remove unused var (#93128)
removes this unused var, the overall buffer comm hook feature is also not being used, we should deprecate / remove it as it is still a private API.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93128
Approved by: https://github.com/awgu
2023-01-27 18:08:29 +00:00
4eb69af5af Upgrade CI to ROCm 5.4.2 (#92972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92972
Approved by: https://github.com/malfet
2023-01-27 17:57:33 +00:00
00f3e0d8c9 [ci] Set step level timeout (#93084)
Not super important, but it is nice for the logs because the logs now say "the action timed out" instead of "the action was cancelled".  It also makes the job status "failure" instead of "cancelled"

also adds timeout minutes as an input for rocm and mac tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93084
Approved by: https://github.com/huydhn
2023-01-27 17:52:33 +00:00
62aa4e096b Revert "Add cudnn install 8.7.0.84 for CUDA 11.8 (#93086)"
This reverts commit 3a10bf791f53c65e4c38c29e366b45504425832a.

Reverted https://github.com/pytorch/pytorch/pull/93086 on behalf of https://github.com/malfet due to Failures are related
2023-01-27 16:22:14 +00:00
d3049378be Repair the path to jni.h for libtorch windows build (#93057)
Fixes #86536

It seems like the file is not found when the environment is populate, so the BUILD_JNI flag is false.

To mark it as true, I had to add a `/pytorch/` when adding paths in `POSSIBLE_JAVA_HOMES`. This way, it seems like the file is found and the flag it's true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93057
Approved by: https://github.com/malfet, https://github.com/Blackhex
2023-01-27 15:20:30 +00:00
64d0624cee Explicit Name needed to run with buck test (#93035)
Summary: Explicit Name needed to run with buck test

Test Plan: sandcastle

Differential Revision: D42763774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93035
Approved by: https://github.com/cpuhrsch
2023-01-27 14:36:46 +00:00
3a10bf791f Add cudnn install 8.7.0.84 for CUDA 11.8 (#93086)
Add cudnn install 8.7.0.84 for CUDA 11.8 .

Same as: https://github.com/pytorch/pytorch/pull/84964
Related to https://github.com/pytorch/builder/pull/1271
Test PR: https://github.com/pytorch/pytorch/pull/92971
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93086
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-01-27 13:13:31 +00:00
68a98537d5 [fix] nn c++ : segfault in modulelist and moduledict (#93074)
Fixes https://github.com/pytorch/pytorch/issues/73565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93074
Approved by: https://github.com/albanD
2023-01-27 12:20:19 +00:00
219e9533f0 Improve autograd doc on complex numbers (#93065)
A tiny change to fix formatting and clarify a bit in [this section](https://pytorch.org/docs/stable/notes/autograd.html#what-are-complex-derivatives).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93065
Approved by: https://github.com/albanD
2023-01-27 09:36:38 +00:00
5105a8d3fc Enable Kineto in OSS builds by fixing build condition (resubmit) (#93033)
Resubmit of https://github.com/pytorch/pytorch/pull/89174 . I think I fixed underlying issues back then, but only CI would tell.

Context: This PR enables Kineto on OSS builds because of how the flags were misconfigured before. I think generally having global observer in OSS is nice. There's some work to release on demand profiling with dynolog, and right now its build instructions start with "go change pytorch's CMake": https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md#pytorch-setup

The previous PR was reverted because of the bug in Kineto that got fixed in https://github.com/pytorch/kineto/pull/696 (and the submodule was updated since)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93033
Approved by: https://github.com/kimishpatel
2023-01-27 08:58:03 +00:00
070163fb53 [inductor] Clean up TRITON_CACHE_DIR (#92879)
Summary:
As a follow up in https://github.com/pytorch/pytorch/pull/92664 (D42619405 (e6a8267cf5)), clean up the TRITON_CACHE_DIR settings. There are a few places touching TRITON_CACHE_DIR:

1. triton/fb/triton_util.py: when import triton
2. caffe2/torch/_inductor/codecache.py
3. caffe2/torch/_inductor/triton_ops/autotune.py
4. triton/triton/python/triton/compiler.py

IIUC there are two entry points:
* kernel.run(args): 1 -> 3 -> 4
* async_compile(kernel): 1 -> 2 -> 3 -> 4
* calling triton jit-annoated func directly: 4

I'm removing the TRITON_CACHE_DIR in 1 and 2.

Test Plan: Run local repro

Differential Revision: D42694374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92879
Approved by: https://github.com/jansel
2023-01-27 08:08:27 +00:00
6fa84fdea2 [FX][Quant] Enable FX quant for patterns like x.view(x.size(...), ...) (#90001)
**Summary**
This work continues with https://github.com/pytorch/pytorch/pull/83784 by @vkuzo and includes all the changes in that PR.
Quote from https://github.com/pytorch/pytorch/pull/83784:
> Issue #83658 reports that ops followed by a certain pattern of `view` and `size` ops were not quantized correctly by FX graph mode quantization.
Before this PR, the "size" op was in the "op shares qparams with input" category, and the code assumed that the input of this op has the same dtype as its output. This led to incorrectly propagating the `int` dtype as the output of whichever op was preceding the `view` op, which in turn made that op blocklisted from quantization.

> The fix is to create a new category of ops which work on different dtypes of tensors but are not observed. This PR does so for `size`, and also for `shape` since it works the same way.

**Note**: This PR needs https://github.com/pytorch/pytorch/pull/91297 to be landed first otherwise there is a UT failure.

**Test plan**
```
python test/test_quantization.py -k test_linear_size_view
python test/test_quantization.py -k test_linear_shape_view
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90001
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-01-27 07:56:29 +00:00
a4238976a8 [FSDP][optim_state_dict] Ensure correct devices for tensors when doing all_gather (#92992)
When doing `_all_gather_optim_state`, we need to ensure that `step` tensors are  on CPU and other tensors are on GPUs. This PR add the logic to ensure the locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92992
Approved by: https://github.com/fduwjj
2023-01-27 06:50:36 +00:00
8b1b47c36a [FSDP][optim_state_dict] Use all_gather to deal with uneven size tensors (#92991)
The current `_all_gather_optim_state` pads the uneven tensors which is not necessary as `all_gather` support the uneven tensors. This PR removes the padding logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92991
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2023-01-27 06:46:44 +00:00
cyy
f172feae0d More tidy fixes (#93069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069
Approved by: https://github.com/Skylion007
2023-01-27 06:40:50 +00:00
5bae580502 Don't graph break on patched module methods (#93115)
Fix one case for https://github.com/pytorch/pytorch/pull/91018 since it's needed soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93115
Approved by: https://github.com/angelayi
2023-01-27 06:14:44 +00:00
a2e0f8e529 [ FL-gradient quantization] Adding QNN unpack feature (#92714)
Summary: We are trying to add a new feature for quantized gradient computation which enables backward() function for QNNPACK

Test Plan: buck2 test //caffe2/test/quantization:quantization -- test_qlinear_qnnpack_free_memory_and_unpack

Differential Revision: D40927291

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92714
Approved by: https://github.com/digantdesai, https://github.com/jianyuh
2023-01-27 05:37:03 +00:00
661800a2cf Fix BC-breaking change introduced by #91499 (#93091)
This fixes BC-breaking changes introduced by https://github.com/pytorch/pytorch/pull/91499
Make enum accept both `min` and `amin` values
Reinstante testing

To reiterate
454361435c/torch/masked/_ops.py (L786)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93091
Approved by: https://github.com/ngimel
2023-01-27 03:58:35 +00:00
7fade4f771 fixing flag to skip nvfuser_tests build (#93080)
Slowly pushing cmake cleanup to upstream.

avoids building nvfuser_tests when BUILD_TEST is disabled.
nvfuser_tests uses googletest from pytorch, which is only dragged when BUILD_TEST is enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93080
Approved by: https://github.com/davidberard98, https://github.com/huydhn, https://github.com/malfet
2023-01-27 03:48:31 +00:00
e2739372eb [vision hash update] update the pinned vision hash (#93114)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93114
Approved by: https://github.com/pytorchbot
2023-01-27 03:29:39 +00:00
074f5ce0b7 Install Torchvision in all Linux shards (#93108)
Also skip `test_roi_align_dynamic_shapes` for cuda as introduced by https://github.com/pytorch/pytorch/pull/92667.  With Torchvision properly installed, the test fails with the following error:

```
2023-01-26T04:46:58.1532060Z   test_roi_align_dynamic_shapes_cuda (__main__.CudaTests) ... /var/lib/jenkins/workspace/test/inductor/test_torchinductor.py:266: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
2023-01-26T04:46:58.1532195Z   buffer = torch.as_strided(x, (x.storage().size(),), (1,), 0).clone()
2023-01-26T04:46:58.1532383Z     test_roi_align_dynamic_shapes_cuda errored - num_retries_left: 3
2023-01-26T04:46:58.1532479Z Traceback (most recent call last):
2023-01-26T04:46:58.1532725Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 1155, in run_node
2023-01-26T04:46:58.1532821Z     return node.target(*args, **kwargs)
2023-01-26T04:46:58.1533056Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_ops.py", line 499, in __call__
2023-01-26T04:46:58.1533160Z     return self._op(*args, **kwargs or {})
2023-01-26T04:46:58.1533304Z RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides
```

https://github.com/pytorch/pytorch/issues/93054 reveals a blindspot in the CI where Torchvision was only installed in the first and second shard.  The above test should show that failure as part of https://github.com/pytorch/pytorch/pull/92667, but then it was skipped because Torchvision was not installed (in the 3rd shard) for `test_roi_align` to run.  The test is still skipped here, but in a more explicit way.

Fixes https://github.com/pytorch/pytorch/issues/93054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93108
Approved by: https://github.com/clee2000, https://github.com/jjsjann123, https://github.com/nkaretnikov
2023-01-27 03:15:18 +00:00
025ef99ddf Get rid of dedicated inductor dynamic_shapes config (#93076)
Instead, use Dynamo dynamic_shapes config

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93076
Approved by: https://github.com/voznesenskym
2023-01-27 02:58:16 +00:00
f3fcc80622 [dtensor][7/N] remove backend in with_comms (#93040)
backend is not actually getting used in anywhere, so we remove the
backend option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93040
Approved by: https://github.com/wz337
2023-01-27 02:53:27 +00:00
8b3e01cd30 [DTensor] implement dist_cat as a sharding prop rule (#92677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92677
Approved by: https://github.com/wanchaol
2023-01-27 02:14:17 +00:00
24172eebac [ONNX] Export 'aten::index_put(self, mask, v)' when rank(mask) < rank(self) (#92862)
Fix #92540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92862
Approved by: https://github.com/justinchuby
2023-01-27 02:00:56 +00:00
95dfad9d93 Add kwargs support to torch.export() API (#92013)
Fixes [#1997](https://github.com/pytorch/torchdynamo/issues/1997)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92013
Approved by: https://github.com/jansel
2023-01-27 01:58:51 +00:00
ae171cf623 [ci] Move sm86 from trunk to pull (#93085)
Experiment on capacity
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93085
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-27 01:44:36 +00:00
d1807dc1f4 Fix topk IMA (#93095)
Hopefully, this will fix https://github.com/pytorch/pytorch/issues/93006. ~I can not reproduce that issue: I can catch the IMA with compute sanitizer on nightly build, but not on source build of master. So there is no way for me to validate if my fix is correct or not.~ Edit: Thanks for the help of @ptrblck, this fix is validated.

But by reading the code, I believe this is a similar issue as https://github.com/pytorch/pytorch/pull/83042, so I apply the same fix for `mbtopk::gatherTopK`. We can wait until tomorrow's nightly build to see if #93006 disappear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93095
Approved by: https://github.com/ngimel
2023-01-27 01:39:49 +00:00
8d7f9e2f79 Make __deepcopy__ of GraphModule able to handle circular reference. (#93038)
Summary:
One of such places where circular reference can occur is: _load_state_dict_pre_hooks contains a _WrappedHook, _WrappedHook has a weakref to the same module.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93038
Approved by: https://github.com/jerryzh168
2023-01-27 01:19:59 +00:00
ceb44350cf [CI] Move parallel native builds to 3.8 (#93103)
As well as nightly docs builds
Followup after https://github.com/pytorch/pytorch/pull/92928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93103
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/kit1980
2023-01-26 23:29:48 +00:00
f6f46ba3bb [Reland] aot autograd explicitly errors on double backward (#92893)
This reverts commit fb980581a7b41a5ea570fcb03829463b806b3bbc.

Testing: `python benchmarks/dynamo/timm_models.py  --float32 --training --only=mobilevit_s --performance --inductor --disable-cudagraphs`

```
main:               memory: eager: 12.30 GB, dynamo: 12.28 GB, ratio: 1.00
+ #90896 reverted:  memory: eager: 12.30 GB, dynamo: 8.81 GB, ratio: 1.40
+ this PR:          memory: eager: 12.30 GB, dynamo: 8.81 GB, ratio: 1.40
```

For comparison, if we apply old version of this PR instead:
```
main:
+ #90896 reverted:         memory: eager: 12.30 GB, dynamo: 8.81 GB, ratio: 1.40
+ old version of this PR   memory: eager: 12.30 GB, dynamo: 10.36 GB, ratio: 1.19
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92893
Approved by: https://github.com/bdhirsh
2023-01-26 23:19:27 +00:00
913cf2908e Revert "Disable torch_jit_fuser_te for dynamo CI (#92945)"
This reverts commit 0fc2f9febb8147183bcf8321ea80ab8e48ced875.

Reverted https://github.com/pytorch/pytorch/pull/92945 on behalf of https://github.com/huydhn due to The test looks ok now after moving dynamo shard to 3.8 https://github.com/pytorch/pytorch/issues/92942, so trying to re-enable it
2023-01-26 21:41:17 +00:00
340811bf8d Torchinductor randn_like lowering (#93005)
Add lowering for randn_like, fixes https://github.com/pytorch/pytorch/issues/92368 by virtue of not taking a fallback path, although the 0-element prim stride is still incorrect. Would be nice to submit as a decomposition, but that is blocked by https://github.com/pytorch/pytorch/issues/92920.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93005
Approved by: https://github.com/ngimel
2023-01-26 21:35:27 +00:00
1b5bfe9dd1 Properly compute device for elementwise operations with CPU scalar tensor (#93073)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93073
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2023-01-26 21:27:57 +00:00
1f352f7c1f Update flatbuffer test models to match pkl models (#93022)
Also regenerate upgrader with

```
python torchgen/operator_versions/gen_mobile_upgraders.py
```

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93022
Approved by: https://github.com/tugsbayasgalan
2023-01-26 21:17:57 +00:00
68a49322e7 [MacOS] Explicitly use cmake from cloned conda environment (#92737)
My first attempt to fix `Library not loaded: @rpath/libzstd.1.dylib` issue on MacOS M1 in https://github.com/pytorch/pytorch/pull/91142 provides some additional logs about flaky error but doesn't fix the issue as I see some of them recently, for example

* e4d83d54a6

Looking at the log, I can see that:

* CMAKE_EXEC correctly points to `CMAKE_EXEC=/Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/cmake`
* The library is there under the executable rpath
```
ls -la /Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/../lib
...
2023-01-20T23:22:03.9761370Z -rwxr-xr-x    2 ec2-user  staff    737776 Apr 22  2022 libzstd.1.5.2.dylib
2023-01-20T23:22:03.9761630Z lrwxr-xr-x    1 ec2-user  staff        19 Jan 20 22:47 libzstd.1.dylib -> libzstd.1.5.2.dylib
...
```

Then calling cmake after that suddenly uses the wrong cmake from miniconda package cache:

```
2023-01-20T23:22:04.0636880Z + cmake ..
2023-01-20T23:22:04.1924790Z dyld[85763]: Library not loaded: @rpath/libzstd.1.dylib
2023-01-20T23:22:04.1925540Z   Referenced from: /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake
```

This is weird, so my second attempt will be more explicit and use the correct cmake executable in `CMAKE_EXEC`.  May be something manipulates the global path in between making ` /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake` comes first in the PATH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92737
Approved by: https://github.com/ZainRizvi
2023-01-26 21:07:41 +00:00
15c46eb89b Remove try catch in test_torchinductor (#93004)
I think this was holdover compat code from multiple repros. we should error on failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93004
Approved by: https://github.com/ngimel
2023-01-26 20:57:36 +00:00
17803fb36e Make meshgrid support symbolic shapes (#93075)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93075
Approved by: https://github.com/Skylion007
2023-01-26 20:57:29 +00:00
5de19dd348 Don't copy name_to_input in OutputGraph (#93034)
This copy isn't necessary and regressed tracing Adam by ~10s with a 1000 parameter model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93034
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-01-26 20:53:46 +00:00
f30787e52d Update XLA docker image to v0.8 (#93041)
Given the context in https://github.com/pytorch/xla/pull/4489, we now have a new XLA Docker image `v0.8`. This should fix the flaky sccache initialization failures with XLA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93041
Approved by: https://github.com/malfet
2023-01-26 20:18:37 +00:00
d9f0d14835 Update RELEASE.md with pinning xla and builder PRs (#93079)
Provide example PRs necessary for pinning xla and builder repos for release

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93079
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-01-26 20:11:30 +00:00
0e92bbe5b1 Add sparse COO tensor support to torch.sum(dim=..., keepdim=...) (#92979)
Fixes #92757, #86232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92979
Approved by: https://github.com/cpuhrsch
2023-01-26 18:42:51 +00:00
ca2a23c243 [BE][CI] Move more builds from 3.7 to 3.8 (#92928)
Part of https://github.com/pytorch/pytorch/issues/80513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92928
Approved by: https://github.com/weiwangmeta, https://github.com/ZainRizvi
2023-01-26 18:13:16 +00:00
729f1a8ef2 Setup shebang and set -x on generated runner script (#93007)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93007
Approved by: https://github.com/williamwen42
2023-01-26 16:52:38 +00:00
7012d985fa Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 46f16b93636615a81242b0d5cded84c5a57fd2e2.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/ZainRizvi due to Causing a test to fail consistently: test_decomp.py::HasDecompTest::test_has_decomposition
2023-01-26 16:22:29 +00:00
3888555fa1 Apply some more missing moves in aten native (#92983)
Add some additional missing moves to further improve vmap and related operators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92983
Approved by: https://github.com/ezyang
2023-01-26 15:52:16 +00:00
7e449e8ba7 Fix some silly Inductor bugs (#92997)
Should probably figure out how to get type checking going, would have
caught these cases.

Discovered in pursuit of https://github.com/pytorch/pytorch/issues/91719
though this is not enough.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92997
Approved by: https://github.com/Chillee
2023-01-26 15:31:54 +00:00
abcaa05f55 Revert spurious submodule change from #92107 (#93067)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93067
Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007, https://github.com/malfet
2023-01-26 14:57:36 +00:00
5e9fa0a8fc Mark crossvit_9_240 as passing dynamic=True (#92981)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92981
Approved by: https://github.com/Chillee
2023-01-26 13:05:37 +00:00
1d03a6a901 [Quant][Fx] Fix issue: qconfig_mappings of onednn backend are not correctly set for fused modules (#91297)
**Summary**
For onednn quantization backend only.
Currently, FX fusion requires that all separate ops in a fused module/op have the same `qconfig`. To support `linear - leaky_relu` and `linear - tanh` fusion with onednn backend, we previously explicitly set the same `qconfig` to `linear`, `leaky_relu` and `tanh`. However, this brings two problems:
- It breaks fusion of `linear - relu` since `relu` does not have the same `qconfig` as `linear` does. And it does not look good if we set `qconfig` to all these ops. They should use a global `qconfig` by default.
- `Tanh` requires `fixed_qparams_qconfig` otherwise it is not quantized. So, we cannot set another `qconfig` to `tanh`.

Looks like there is not a straightforward way to solve the problems. This PR fixes them by the following:
- Do not set `qconfig` to these ops so that these ops use a global `qconfig` and `linear - relu` and `linear - leaky_relu` can be fused correctly.
- Set the same `qconfig` to `linear` and `tanh` manually by users when they want to fuse `linear - tanh` with onednn backend.

A known issue still exists: users cannot fuse `linear - tanh` and quantize standalone `tanh` at the same time.

**Test plan**
python test/test_quantization.py -k test_qconfig_dict_with_fused_modules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91297
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-01-26 09:55:34 +00:00
913866efbf [PT-D][TP] Fix TP API for FQN path based parallelization (#93029)
We have not tested dict based parallelize_module and turns out we had mistakes here.

1. Fix the error.
2. Add unit test cases for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93029
Approved by: https://github.com/wz337
2023-01-26 09:10:21 +00:00
46f16b9363 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-26 07:58:27 +00:00
4c074ddfd2 [functorch][reland] vmap: bitwise operators (#92836)
Previous PR: #91971

Fixes: https://github.com/pytorch/functorch/issues/1069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92836
Approved by: https://github.com/Chillee
2023-01-26 06:12:47 +00:00
ccad2e5000 Include cublasLt as an option in max_autotune mode (#92915)
Differential Revision: D42720376 (has some internal results)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92915
Approved by: https://github.com/Chillee
2023-01-26 06:08:17 +00:00
d88bc38b0c [functorch] fix batching rule for dropout (#92975)
Fixes https://github.com/pytorch/pytorch/issues/92283

The repro now works:
```python
import torch
import torch.func
import torch.nn as nn

x = torch.randn(3, device='cuda')
y = torch.randn(1, 3, device='cuda')

def fn(x, y):
    # previously output of dropout used to be incorrect [B, 3] (B=1) and thus `mean(1)` used to fail
    # post the fix output of dropout is [B, 1, 3] and `mean(1)` works.
    return x + nn.functional.dropout(y, 0.3).mean(1)

o = torch.func.vmap(fn, in_dims=(0, None), randomness='different')(x, y)
```

**NOTE**:
`native_dropout_batching_rule(const Tensor& tensor, double p, c10::optional<bool> train)` was called only for CUDA tensor. Hence this issue only affected CUDA tensors and not CPU tensors

Ref:
a6ac922eab/aten/src/ATen/functorch/PyTorchOperatorHacks.cpp (L251-L258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92975
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2023-01-26 05:07:26 +00:00
77f336600a [PT-D] Enable Meta Tensor Support for DTensor (#92652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92652
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2023-01-26 04:54:57 +00:00
e714e37a06 [optim][sgd] default to foreach when CUDA + differentiable=False (#92730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92730
Approved by: https://github.com/albanD
2023-01-26 04:52:58 +00:00
8c9f745af1 [foreach] guard default support on native tensors only (#92923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92923
Approved by: https://github.com/ngimel, https://github.com/crcrpar
2023-01-26 04:52:58 +00:00
c9ce0e63e8 [Dynamo] Support context wrapping(e.g, torch.no_grad) on nested functions w/o closure (#92922)
Fixes 14k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_ELEKTRONN_elektronn3.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92922
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-01-26 04:23:35 +00:00
a6b51448f5 [Dynamo] Supports if condition on user defined object (#90892)
Fixes Meta internal user case, see the pattern in unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90892
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-01-26 04:19:32 +00:00
819bd5b77a [nn] add set_to_none flag for C++ optim endpoint (#92989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92989
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2023-01-26 04:16:52 +00:00
dbeb513192 [vision hash update] update the pinned vision hash (#92937)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92937
Approved by: https://github.com/pytorchbot
2023-01-26 04:02:28 +00:00
68f198913a Revert "Mark XLA Linux jobs as unstable temporarily (#92634)"
This reverts commit 3cc103132205820fc0c571e3e68dd5e9b5b85727.

Reverted https://github.com/pytorch/pytorch/pull/92634 on behalf of https://github.com/huydhn due to XLA has been forward fixed by 341613fc14
2023-01-26 03:59:51 +00:00
f646126ecd Running timm benchmarks no longer silently retries (#93030)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93030
Approved by: https://github.com/eellison
2023-01-26 03:44:38 +00:00
d322f82b05 Add @count util to torch, use it to track benchmark stats (#93013)
<img width="1333" alt="image" src="https://user-images.githubusercontent.com/4755252/214687911-f766f072-c162-4298-9aed-c889f1375336.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93013
Approved by: https://github.com/ezyang
2023-01-26 03:09:12 +00:00
c11b301bcd [NVFUSER] refactor nvfuser build (#89621)
This PR is the first step towards refactors the build for nvfuser in order to have the coegen being a standalone library.

Contents inside this PR:
1. nvfuser code base has been moved to `./nvfuser`, from `./torch/csrc/jit/codegen/cuda/`, except for registration code for integration (interface.h/interface.cpp)
2. splits the build system so nvfuser is generating its own `.so` files. Currently there are:
    - `libnvfuser_codegen.so`, which contains the integration, codegen and runtime system of nvfuser
    - `nvfuser.so`, which is nvfuser's python API via pybind. Python frontend is now exposed via `nvfuser._C.XXX` instead of `torch._C._nvfuser`
3. nvfuser cpp tests is currently being compiled into `nvfuser_tests`
4. cmake is refactored so that:
    - nvfuser now has its own `CMakeLists.txt`, which is under `torch/csrc/jit/codegen/cuda/`.
    - nvfuser backend code is not compiled inside `libtorch_cuda_xxx` any more
    - nvfuser is added as a subdirectory under `./CMakeLists.txt` at the very end after torch is built.
    - since nvfuser has dependency on torch, the registration of nvfuser at runtime is done via dlopen (`at::DynamicLibrary`). This avoids circular dependency in cmake, which will be a nightmare to handle. For details, look at `torch/csrc/jit/codegen/cuda/interface.cpp::LoadingNvfuserLibrary`

Future work that's scoped in following PR:
- Currently since nvfuser codegen has dependency on torch, we need to refactor that out so we can move nvfuser into a submodule and not rely on dlopen to load the library. @malfet
- Since we moved nvfuser into a cmake build, we effectively disabled bazel build for nvfuser. This could impact internal workload at Meta, so we need to put support back. cc'ing @vors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89621
Approved by: https://github.com/davidberard98
2023-01-26 02:50:44 +00:00
0a57a20c02 [caffe2] Fix pybind11 native python link error (#92325)
Summary:
Currently, we define some C++ functions in one C++ Python extension
which are used by another.  This happens to work, but isn't guaranteed to.
This diff moves these functions to a separate C++ library rule to fix this.

Test Plan: CI

Differential Revision: D42552515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92325
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2023-01-26 02:33:17 +00:00
341613fc14 Move the pin to latest to unbreak the xla CI (#93000)
This should unbreak the XLA CI since we disabled the failing test on our end.

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93000
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-26 02:13:49 +00:00
32bcb97c7a [package] Add better debugging for torch.package (#92939)
Summary:
Makes torch.package debugging more transparent by
1. Pointing out not implictily externed modules in the standard library.
2. Creating a debug mode for users to find the source of broken modules.

Test Plan: Run package tests

Differential Revision: D42728753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92939
Approved by: https://github.com/kurman
2023-01-26 02:11:12 +00:00
22b6a5fda9 Update base docker image tags for ROCm CI (#90694)
to make them agnostic of ubuntu version, ROCm version and python minor version.

This should help avoid frequent updates to the docker image tags when upgrading ROCm version in PyTorch CI, which has creation of new ECR tags as a blocking step.

Reference: https://github.com/pytorch/pytorch/pull/88297#issuecomment-1307873280

The BUILD_ENVIRONMENT flag will continue to specify the exact versions for the above, in case it is needed for debug. @malfet @seemethere Hope that's not going away, otherwise we might have a harder time debugging issues where we need to figure out these environment details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90694
Approved by: https://github.com/malfet
2023-01-26 02:00:15 +00:00
cee5174d44 Add test tracking operators without decompositions (#90887)
This test inspects the dispatcher directly, so captures operators without
`OpInfo` including internal helper operators and backward operators that might
appear in a trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90887
Approved by: https://github.com/ezyang
2023-01-26 01:44:42 +00:00
345695e8f7 Remove PY37 from binary build matrix (#92919)
Similar to https://github.com/pytorch/test-infra/pull/1416 but for binary build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92919
Approved by: https://github.com/atalman
2023-01-26 01:25:47 +00:00
1af9231c98 Replace IndexingDiv with FloorDiv in test_torchinductor (#93003)
Holdover from https://github.com/pytorch/pytorch/pull/92878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93003
Approved by: https://github.com/ngimel
2023-01-26 01:23:09 +00:00
1f55f3b0de Solving the under/overflow for complex division (#92539)
Fixes #92043.
I'm following numpy's implementation as suggested by @min-jean-cho.
I found out that this implementation still produces overflow if we're working with numbers greater than `finfo.max / 2`, but this is still much better than the previous implementation where it gets overflow with numbers greater than `finfo.max ** 0.5`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92539
Approved by: https://github.com/lezcano
2023-01-26 01:14:06 +00:00
b90496eef5 [nn] zero_grad() set_to_none default True (#92731)
Attempts to fix #92656

BC-breaking! This changes the default of zero_grad in optim and in nn to default set grads to None instead of zero tensors. We are changing the default because there are proven perf wins and existing code has typically not regressed due to this change. (will probably have to flesh out this note more).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92731
Approved by: https://github.com/ngimel
2023-01-26 01:04:28 +00:00
5441f2c067 Fix DDPOptimizer fake_mode execution (#92986)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #92986

When running compiled submods for the purpose of producing outputs to pass
to the compilation step for the next submod, we use fake parameters and
assume fake inputs, but we forgot to activate our fake_mode during execution.

This caused certain edge cases where tensors other than activations or parameters
got created during execution, such as scalar->tensor expansion in the case
of executing torch.where(tensor, scalar, scalar).

Also add a test and clarify behavior of DDPOptimizer via comments.

Fixes #92941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92986
Approved by: https://github.com/bdhirsh
2023-01-26 00:37:54 +00:00
e7b7e8dc3d [SDPA] Remove unused rng_engine_inputs (#93024)
The unused variable in `fmha_api.cpp` [here](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp#L313) was causing build failures (internally) due to to the `-Wunused-variable` flag being used. For example:
```
[2023-01-24T20:32:00.241-08:00] Stderr: aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp:313:25: error: unused variable 'rng_engine_inputs' [-Werror,-Wunused-variable]
[CONTEXT] [2023-01-24T20:32:00.241-08:00]     at::PhiloxCudaState rng_engine_inputs;
[CONTEXT] [2023-01-24T20:32:00.241-08:00]                         ^
[2023-01-24T21:09:33.507-08:00] Stderr: aten/src/ATen/native/transformers/cuda/flash_attn/fmha_api.cpp:313:25: error: unused variable 'rng_engine_inputs' [-Werror,-Wunused-variable]
[CONTEXT] [2023-01-24T21:09:33.507-08:00]     at::PhiloxCudaState rng_engine_inputs;
[CONTEXT] [2023-01-24T21:09:33.507-08:00]
```
This PR removes that unused variable. Mirroring this same patch made by @drisspg internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93024
Approved by: https://github.com/drisspg
2023-01-26 00:10:26 +00:00
dd05f028e2 [PT-D][Checkpoint] Rename DCP storage layer init() (#92869)
Rename DCP storage layer init() and update tests accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869
Approved by: https://github.com/kumpera
2023-01-25 23:52:45 +00:00
b0f3736fa2 [BE][CI] symlink .jenkins to .ci (#92846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92846
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-01-25 23:47:38 +00:00
b453adc945 [BE][CI] rename .jenkins (#92845)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92845
Approved by: https://github.com/clee2000
2023-01-25 23:47:38 +00:00
67689c823f refactor: move dynamo/TorchXLA bridge to pytorch/xla repo (#92601)
This is a follow up from the previous PR: https://github.com/pytorch/pytorch/pull/88449 , to move the dynamo/TorchXLA bridge from pytorch repo to xla repo.

Overall the dynamo/TorchXLA integration has the following four layers of code
- pybind layer: This is the bottom layer containing various pybind APIs as the foundation. This part resident in xla repo
- bridge layer: build upon the pybind layer to implement the trace once functionality. This layer and it's corresponding unit test are in pytorch repro previously. This PR (and the corresponding xla pr https://github.com/pytorch/xla/pull/4476 ) moves them to the xla repo.
- dynamo backend registration: this a thin layer registers 4 dynamo backends (training/inference/trace_once/trace_everytime). It remains in pytorch repo.
- benchmark script: the torchbench.py script in dynamo is adapted so it can be used in dynamo/TorchXLA integration. This one remains in pytorch repo.

We think the new code organization is cleaner.

I'll wait for the xla PR in first before trying to merge this one.

Tests
1. run the unit tests moved to the xla repo
2. Test for inference:  `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --backend=torchxla_trace_once --only resnet18`
3. Test for training: `GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only resnet18 --collect-outputs`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92601
Approved by: https://github.com/wconstab
2023-01-25 23:15:02 +00:00
b2f3ff6183 [Py3.11] Remove skip logic from vmap and forward_ad (#91825)
Depends on https://github.com/pytorch/pytorch/pull/91805

Fixes https://github.com/pytorch/pytorch/issues/85506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91825
Approved by: https://github.com/albanD
2023-01-25 22:40:56 +00:00
f2f42e54ca Apply some std::move and param value fixups to aten (#92901)
I noticed a few perf issues in the latest ATen and decided to fixup a few other miscellaneous ones I noticed recently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92901
Approved by: https://github.com/ezyang
2023-01-25 21:06:51 +00:00
b073c09f7a Added keep_key option to Grouper (#92532)
Fixes https://github.com/pytorch/data/issues/256

The testing of this module is currently suboptimal in general. We should improve this in the future.

@ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92532
Approved by: https://github.com/ejguan
2023-01-25 20:58:21 +00:00
63331a5fac Add --timing and --explain to CI runs (#92980)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92980
Approved by: https://github.com/msaroufim
2023-01-25 20:46:12 +00:00
63e47c68a6 [cpp] remove checks from embedding bag impl (#92982)
These checks incur an H2D sync on every embedding bag forward. Also, the equivalent python code for embedding_bag does not have them. Kill!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92982
Approved by: https://github.com/ezyang
2023-01-25 20:36:44 +00:00
99ced6482a Disable vml's abs and log1p (#92113)
I noticed that `torch.log1p` is ridiculously slow compared to `torch.log`
on CPU, and looking at the assembly it seems vsLog1p doesn't use any
vector instructions. I saw the same for abs, though AFAICT this is
dead code anyway as `abs` is implemented with `cpu_kernel_vec`.

Locally I see a 14x speedup in `torch.log1p`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92113
Approved by: https://github.com/jgong5
2023-01-25 20:09:40 +00:00
d4c8e37b85 Improve performance for unary kernels using vml (#91963)
This gives some speedups for kernels implemented with `at::vml`:
- Make vml ops serial and use `TensorIterator.for_each` for better parallism
with discontiguous tensors
- Reduce buffer size for discontiguous data to 8 KiB to increase chance of
fitting in L1d cache, but is still wide enough to utilize AVX-512.
- Avoid a copy if only one of input and output is discontiguous

There is no change for contiguous tensors, but I see significant speedup for
the following benchmarks:
```
import torch
a = torch.randn(2*10**6, device="cpu")
%timeit a.view(100, 20000)[:,::2].sqrt()
%timeit a.view(200, 10000)[::2].sqrt()
```
For discontiguous last dimension I see a 27x speedup and for discontiguous
batch dimension I see an 8x speedup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91963
Approved by: https://github.com/jgong5
2023-01-25 20:09:40 +00:00
0de81906cc Add get-job-id in get-workflow-job-id action (#93001)
ids for composite workflows are really strange, both the calling step and the step in the composite workflow need an id, but when they're different, the calling step's id takes precedence

Should fix test uploading problem
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93001
Approved by: https://github.com/huydhn
2023-01-25 19:44:52 +00:00
d354499faf adding some more missing ops to vmap (#92110)
removes some xfails that were a part of https://github.com/pytorch/functorch/issues/1009 and https://github.com/pytorch/functorch/issues/1087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92110
Approved by: https://github.com/zou3519
2023-01-25 19:43:12 +00:00
92fbb35bff Upload failures shouldn't fail a CI that passed tests (#92996)
This'll reduce some flakiness we've been seeing recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92996
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-01-25 19:23:51 +00:00
cyy
e292ddff4e More clang-tidy fixes (#92944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944
Approved by: https://github.com/Skylion007
2023-01-25 19:11:51 +00:00
4e67332677 Add few more tests to 3.11 smokechecks (#92946)
Namely:
- test_foreach
- test_schema_check
- test_weak

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92946
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn
2023-01-25 19:02:16 +00:00
b399007a07 Make TensorIterator give better error message for symbolic tensors (#92914)
This is one of the more common reasons to see
"RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides"

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92914
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-01-25 16:32:10 +00:00
c0ed0f22cd [FSDP] Fix no_sync(), use_orig_params=True, mixed precision, sharded (#92874)
When there is an original parameter with 1D shape that is fully assigned to one rank, then its `param.shape == view.shape` in `_use_unsharded_grad_views()`. In that case, we still want to check whether `param.dtype == view.dtype` and bypass as necessary.

The previous PR had an additional `and not self.uses_sharded_strategy` because the unit test did not require the check for sharded strategies, and I was conservatively adding a minimal fix. This was happenstance and because there was no 1D parameter fully assigned to one rank. Including the bias in the linear layer achieves that case, and removing the `and not self.uses_sharded_strategy` is necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92874
Approved by: https://github.com/zhaojuanmao
2023-01-25 14:47:37 +00:00
077e135ed6 add number of cuda retries into tracker (#92557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92557
Approved by: https://github.com/fegin, https://github.com/mrshenli
2023-01-25 14:44:34 +00:00
a6ac922eab Rename Canonical Aten IR to Core Aten IR (#92904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92904
Approved by: https://github.com/bdhirsh
2023-01-25 05:12:23 +00:00
e5fd7e6d8f Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854)
For the `crossvit_9_240` model - it works now with dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92854
Approved by: https://github.com/ezyang
2023-01-25 05:08:02 +00:00
0fc2f9febb Disable torch_jit_fuser_te for dynamo CI (#92945)
Not clear, what caused SIGIOT, but we need to get signal from other tests (and NNC+Dynamo is probably not the most important usecase)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92945
Approved by: https://github.com/ezyang, https://github.com/huydhn
2023-01-25 05:02:59 +00:00
2ee94633a1 Change ciflow/inductor to test inductor inference with dynamic shapes (#92771)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92771
Approved by: https://github.com/voznesenskym
2023-01-25 02:21:02 +00:00
f724ecbd52 Add dynamic shapes aot_eager to periodic (#92770)
This means it overlaps with ciflow/inductor, but I'm about
to change that soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92770
Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/desertfire
2023-01-25 02:21:02 +00:00
9c487a4b91 Fix #92814: assertion error when explicitly provide out=None (#92873)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92873
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-01-25 02:20:53 +00:00
f180873fd5 Revert "[CI] Disable regularly failing CUDA 11.8 windows periodic tests (#92902)"
This reverts commit bcbc522d1f76892b89d9ffb9f581a744c959fbd7.

Reverted https://github.com/pytorch/pytorch/pull/92902 on behalf of https://github.com/atalman due to Fixed by reverting https://github.com/pytorch/pytorch/pull/91727
2023-01-25 01:39:03 +00:00
e45b566018 [inductor] skip CUDA tests under ASAN (#92883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92883
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-01-25 01:29:39 +00:00
a3715efd8b Remove windows check for cmake to build Fused kernels (#91909)
# Summary
Add support for fused attention kernels (FlashAttention and memory-efficient attention) on Windows. Previously we could not do this because the fixes required c++17 to do this but we have since update the PyTorch standard.

This PR:
- Changes invocations of unsigned long to the fixed width integer type
- Adds in the #define FP16_SWITCH(COND, ...) which has been added to the flash_attention main branch
- Changes the some macros used within mem-efficient attention code in order to work around the VA_ARG discrepancy between clang/gcc and msvc. An alternative would be setting the global flag Zc:preprocessor
- Selectively applies /Zc:lambda to only the mem-efficient sources since applying this globally caused quantization files to not compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91909
Approved by: https://github.com/cpuhrsch
2023-01-25 01:21:12 +00:00
f0d09572b0 [CI] Rename TSAN job (#92929)
Underlying docker has actually been migrated from py3_7 to py3_9 as part of https://github.com/pytorch/pytorch/pull/92712 but I forgot to update the TSAN names.

I.e. this is a no-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92929
Approved by: https://github.com/clee2000, https://github.com/weiwangmeta, https://github.com/osalpekar
2023-01-25 00:54:36 +00:00
01f1097770 Revert "Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854)"
This reverts commit d49187bf8882dabfb307de4f3f6a9031426e677a.

Reverted https://github.com/pytorch/pytorch/pull/92854 on behalf of https://github.com/malfet due to Resulted in 50+% flaky failures in dynamo, reverting
2023-01-25 00:10:14 +00:00
54bbb446ca lru_cache shape expansion (20-25% speedup on local bench) (#92860)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92860
Approved by: https://github.com/ezyang, https://github.com/Chillee
2023-01-25 00:01:55 +00:00
78caa7921c [dynamo] Allow DynamicShapeVariable as predicate to cond() op. (#92864)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92864
Approved by: https://github.com/tugsbayasgalan
2023-01-24 23:26:30 +00:00
2503a4a7c6 Fix MPI backend PG initialization (#92847)
Fixes #92573

Add test to check that all default backends can be initialized to prevent the above from regressing in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92847
Approved by: https://github.com/rohan-varma
2023-01-24 23:24:41 +00:00
18d5288010 Add support for Generator=None in inductor (#92851)
Fix for https://github.com/pytorch/pytorch/issues/92633. We don't support generators still but in the case that None is passed in for the generator argument we don't fail now. Generators are sparsely used so we should defer adding full support until it's necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92851
Approved by: https://github.com/ngimel
2023-01-24 23:22:38 +00:00
f3266015a4 Add _StorageMeta metaclass for StorageBase (#92648)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92648
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-01-24 23:08:23 +00:00
4d9920fa9c Move PyInterpreter code in python_variable.cpp to its own files (#92647)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92647
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-01-24 23:08:23 +00:00
4bc0491752 Add USE_FLASH_ATTENTION flag to setup.py (#92903)
# Summary
Adds documentation to setup.py for USE_FLASH_ATTENTION=0 disabling to decrease build times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92903
Approved by: https://github.com/cpuhrsch, https://github.com/bdhirsh
2023-01-24 22:59:51 +00:00
bf1ff4918f Fix Dockerfile conda install error for some shells (#92702)
The issue was first solved in [/pull/91371] for CI/CD, but the main Dockerfile in the repo root still has this issue for people trying to test build custom image manually.
Without it the build fails at installing miniconda
```
#14 3.802 Preparing transaction: ...working... done
#14 4.087 Executing transaction: ...working... done
#14 5.713 /root/miniconda.sh: 438: /root/miniconda.sh: [[: not found
#14 5.713
#14 5.713 Installing * environment...
#14 5.713
#14 5.714 /root/miniconda.sh: 444: /root/miniconda.sh: [[: not found
#14 6.050
#14 6.050 CondaFileIOError: '/opt/conda/pkgs/envs/*/env.txt'. [Errno 2] No such
file or directory: '/opt/conda/pkgs/envs/*/env.txt'
#14 6.050
```

With the modification, locally tested build successfully with `make -f ./docker.Makefile` as instructed in the README

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92702
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-01-24 22:54:22 +00:00
b0f5e15c4c [CI] Enable Python-3.11 in smoke CPU testing (#92787)
Add bionic-py3.11-clang9,  and move vulkan testing to it. Test only fx and jit for the time being (will add more in followup PRs)

Do not install numba, is it's not yet available for python-3.11

Change installed mkl version as the one installed before was incompatible with numpy

TODO: Remove `-c malfet` when required packages become available on default conda channel, namely `numpy`, `setuptools`, `coverage`, `mypy-exensions`, `typing-extensions`, `psutils` and `pyyaml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92787
Approved by: https://github.com/albanD
2023-01-24 22:34:35 +00:00
6c7e6d9689 Make torch.fx compatible with Python-3.11 (#92895)
In 3.11 bytecode size is not constant, so in order to get from `f_lasti` to opcode index, one need to search for the closes offset in disassembled instructions.

Update `_patch_function` to construct code with all the properties that exist in 3.11 runtime.
Update `_torchscript_schema_to_signature` to mark `from` named arg as positional argument only, as this is a reserved keyword in Python and as such checked by `inspect` package in 3.11
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92895
Approved by: https://github.com/albanD
2023-01-24 22:11:50 +00:00
a2da0a0b02 Revert "Add test tracking operators without decompositions (#90887)"
This reverts commit 2740daf7014f34e7c0305694cfb8d51cc6712d2a.

Reverted https://github.com/pytorch/pytorch/pull/90887 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. We reverted https://github.com/pytorch/pytorch/pull/70988 in acdd462b1a and this test starts to fail. There is probably a dependency between the twos
2023-01-24 21:56:58 +00:00
e665f03ad8 Fix dynamo func defaults handling for torch.device, size, dtype (#92880)
Previously, these torch types were not handled in the wrap_bound_arg
handler.

Add a unit test and verify it is fixed.

Fixes #91084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92880
Approved by: https://github.com/ezyang
2023-01-24 21:50:43 +00:00
d49187bf88 Fix to use upsample_bicubic2d.vec decomp for dynamic shape support (#92854)
For the `crossvit_9_240` model - it works now with dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92854
Approved by: https://github.com/ezyang
2023-01-24 21:36:17 +00:00
9b23fd378f Revert "Logcumsumexp for complex in CPU and CUDA (#90847)"
This reverts commit 64985123e48cc9a78545780b23071b445ebddc45.

Reverted https://github.com/pytorch/pytorch/pull/90847 on behalf of https://github.com/malfet due to Reverting to decrease build time, let's discuss the alternatives here
2023-01-24 20:49:08 +00:00
acdd462b1a Revert "Remove deprecated torch.symeig (#70988)"
This reverts commit d70ed68162521341060b06985620cdbef04a8fa9.

Reverted https://github.com/pytorch/pytorch/pull/70988 on behalf of https://github.com/kit1980 due to Failing XLA tests, forward fix unsuccessful
2023-01-24 19:03:40 +00:00
16f7db5287 Don't fail-fast for docs, only push on schedule and some tags (#92853)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92853
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-24 18:48:06 +00:00
d4a35e21c0 Revert "[MacOS] Explicitly use cmake from cloned conda environment (#92737)"
This reverts commit b6f41e2bcd69e3e38109232f6684063ab828473d.

Reverted https://github.com/pytorch/pytorch/pull/92737 on behalf of https://github.com/huydhn due to This does not work abe64889b8, still have no idea why this is flaky, need rework
2023-01-24 18:34:39 +00:00
550f98332b [fix] vmap and anomaly mode interaction (#92672)
Fixes https://github.com/pytorch/functorch/issues/1049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92672
Approved by: https://github.com/albanD
2023-01-24 18:12:52 +00:00
fb46d3e138 Run all of the timm models shards in the periodic (#92900)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92900
Approved by: https://github.com/bdhirsh, https://github.com/atalman
2023-01-24 17:56:20 +00:00
2740daf701 Add test tracking operators without decompositions (#90887)
This test inspects the dispatcher directly, so captures operators without
`OpInfo` including internal helper operators and backward operators that might
appear in a trace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90887
Approved by: https://github.com/ezyang
2023-01-24 17:38:27 +00:00
5f09f76b5d Revert "Revert 61cdae0ce58bcbe048b143356fd9ded821225657 to fix CI (#92631)"
This reverts commit 0998ec1e27b9d929275d43d324dd9342409f705c.

Reverted https://github.com/pytorch/pytorch/pull/92631 on behalf of https://github.com/huydhn due to Windows G5 runner has been switched to non-ephemeral. All tests pass on https://github.com/pytorch/pytorch/pull/92876
2023-01-24 17:31:13 +00:00
a817008bb3 Fix #92108 (#92870)
You can easily test this by adding

```
@patch.object(config.triton, "convolution", "triton")
```

to test_convolution1 but it takes a long time to autotune so
I don't want to add it to the unit tests.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92870
Approved by: https://github.com/albanD
2023-01-24 17:22:52 +00:00
9e56378ef2 Add documentation for DCP. (#92813)
This populates the website with some basic documentation.

It's far from ideal as we should include some basic usage example.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92813
Approved by: https://github.com/wz337
2023-01-24 17:21:51 +00:00
bcbc522d1f [CI] Disable regularly failing CUDA 11.8 windows periodic tests (#92902)
These periodic tests were introduced in https://github.com/pytorch/pytorch/pull/92137

They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: d8aa68c683
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92902
Approved by: https://github.com/malfet
2023-01-24 17:20:40 +00:00
68a40a47a0 [Inductor] Lower aten.tan (#92837)
Related #92047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92837
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-01-24 16:35:40 +00:00
19c9b09449 Replace IndexingDiv with FloorDiv in Inductor (#92878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92878
Approved by: https://github.com/ezyang
2023-01-24 15:06:22 +00:00
c0327eb463 Some more inductor fixes for symbolic shapes (#92867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92867
Approved by: https://github.com/ezyang
2023-01-24 15:05:46 +00:00
0fe5367058 [Vulkan] implement abs (#87414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87414
Approved by: https://github.com/albanD
2023-01-24 14:20:34 +00:00
7265f60ad0 Regularize mask handling for attn_mask and key_padding_mask (#92733)
Summary:
Regularize mask handling for attn_mask and key_padding_mask
* Update documentation to remove reference to byte masks (which were deprecated long ago)
* Introduce check and warn about deprecation if attn_mask and key_padding_mask types mismatch
* Convert all masks to float before combining
* Combine by adding

Test Plan: sandcastle & github CI

Differential Revision: D42653215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92733
Approved by: https://github.com/ngimel, https://github.com/drisspg
2023-01-24 14:12:05 +00:00
a2e1365248 [functorch] Remove not needed named member polyfill functions (#92613)
The `nn.Module` APIs already support `remove_duplicate` argument. It's time to retire these not needed polyfill functions. They are identical to the `nn.Module.named_parameters` and `nn.Module.named_buffers` methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92613
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-01-24 13:15:32 +00:00
d8aa68c683 make sure that our error handling runs with the GIL enabled (#92848)
Fixes https://github.com/pytorch/pytorch/issues/92684

I checked the other use case of this API and they never release the GIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848
Approved by: https://github.com/ngimel
2023-01-24 09:30:42 +00:00
abe64889b8 [inductor] make conv2d tests pass (#91952)
```
TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 python -m pytest -v test/inductor/test_torchinductor.py -k test_conv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91952
Approved by: https://github.com/ezyang
2023-01-24 09:08:34 +00:00
cyy
045d1de02d Fix some code issues (#92760)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92760
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-24 08:19:03 +00:00
3f64c96655 asarray: Add support for NumPy scalars (#90914)
Follow up from: Quansight-Labs/numpy_pytorch_interop#3

This PR adds support for NumPy scalars for `torch.asarray`.

**Before:** treats the scalar as an object that implements the buffer protocol. Thus, interprets the data as the default data type (`float32`)

```python
>>> torch.asarray(numpy.float64(0.5))
tensor([0.0000, 1.7500])
```

**After:** identifies the NumPy scalar, and does the "right" thing. i.e. creates a 0-dimensional tensor from the NumPy array that doesn't share its memory

```python
>>> torch.asarray(numpy.float64(0.5))
tensor(0.5000, dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90914
Approved by: https://github.com/lezcano, https://github.com/mruberry
2023-01-24 08:09:30 +00:00
cc4fbd1077 remove default implementation for RoIAlignRotatedOp::RunOnDevice (#92885)
Summary: the default implementation is not needed as there are template specialization defined in the cpp and cu files.

Test Plan: CI

Differential Revision: D42697874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92885
Approved by: https://github.com/davidberard98
2023-01-24 07:20:37 +00:00
70f4b3551c Add Hook to store arbitrary python objects that are copied over in tls (#89169)
For the cudagraphs implementation, we would like to reuse objects that are defined in python across the forward and backward. The backward is run in a different thread, so to handle this we add an api for copying over arbitrary python objects in pytorch's thread local state, in the same way that C++ objects are copied over currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89169
Approved by: https://github.com/albanD
2023-01-24 05:24:57 +00:00
118a6dd1f1 [vision hash update] update the pinned vision hash (#92875)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92875
Approved by: https://github.com/pytorchbot
2023-01-24 05:23:43 +00:00
b6f41e2bcd [MacOS] Explicitly use cmake from cloned conda environment (#92737)
My first attempt to fix `Library not loaded: @rpath/libzstd.1.dylib` issue on MacOS M1 in https://github.com/pytorch/pytorch/pull/91142 provides some additional logs about flaky error but doesn't fix the issue as I see some of them recently, for example

* e4d83d54a6

Looking at the log, I can see that:

* CMAKE_EXEC correctly points to `CMAKE_EXEC=/Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/cmake`
* The library is there under the executable rpath
```
ls -la /Users/ec2-user/runner/_work/_temp/conda_environment_3971491892/bin/../lib
...
2023-01-20T23:22:03.9761370Z -rwxr-xr-x    2 ec2-user  staff    737776 Apr 22  2022 libzstd.1.5.2.dylib
2023-01-20T23:22:03.9761630Z lrwxr-xr-x    1 ec2-user  staff        19 Jan 20 22:47 libzstd.1.dylib -> libzstd.1.5.2.dylib
...
```

Then calling cmake after that suddenly uses the wrong cmake from miniconda package cache:

```
2023-01-20T23:22:04.0636880Z + cmake ..
2023-01-20T23:22:04.1924790Z dyld[85763]: Library not loaded: @rpath/libzstd.1.dylib
2023-01-20T23:22:04.1925540Z   Referenced from: /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake
```

This is weird, so my second attempt will be more explicit and use the correct cmake executable in `CMAKE_EXEC`.  May be something manipulates the global path in between making ` /Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/cmake` comes first in the PATH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92737
Approved by: https://github.com/ZainRizvi
2023-01-24 05:14:14 +00:00
0bf7506051 [CUDA] Drop CUDA < 11.0 test flags (#92605)
Follow-up of #89582 to drop flags like `CUDA11OrLater` in tests. Note that in some places it appears that `TEST_WITH_ROCM` is _implicitly_ guarded against via the `CUDA11OrLater` version check, based on my best-guess of how `torch.version.cuda` would behave in ROCM builds, so I've added `not TEST_WITH_ROCM` in cases where ROCM wasn't previously explicitly allowed.

CC @ptrblck @malfet @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92605
Approved by: https://github.com/ngimel
2023-01-24 04:34:06 +00:00
a799acec8b Allow cublas an cudnn to be in different nvidia folders (#92122)
Fixes #92096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92122
Approved by: https://github.com/malfet
2023-01-24 04:11:44 +00:00
eb32bb2ca6 [Executorch][Quantization] Backend Config for functional embedding (#92700)
Summary: title

Test Plan: ci

Differential Revision: D42643985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92700
Approved by: https://github.com/jerryzh168
2023-01-24 03:12:56 +00:00
9613395e2f [SDPA] Integrating the main branch of flash_attn instead of cutlass (#91994)
### Background

Early on in this process of integrating the FlashAttention code into core we were speaking with Tri and we came to the conclusion that the main branch of Flash Attention wasn't suitable for integration.  We instead went with a [refactored version](https://github.com/HazyResearch/flash-attention/tree/cutlass) that more heavily depended upon cutlass.

That is the current version of FlashAttention in PyTorch. However there are some limitations with that branch.
- No backward support for SDPA
- Not as performant for some large MHA setups.

### Sumary
This PR pulls in the latest version of the main branch of  [FlashAttention](https://github.com/HazyResearch/flash-attention/tree/main). It does not register the backward for the aten function SDPA_flash_attn. That will be done in a follow up PR.

### Changeset
A few changes were made to the original code for PyTorch.
- Flattened one layer of folder structure. (This is to match the the existing FlashAttention in core structure)
- Remove return_softmax param and change mha_fwd signature. Since the SDPA in core public function does not support need_weights we remove this argument.
- Add a lot of  `#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >=530` around sections of code that will not compile for architecture less  or equal to 520. Most of these blocks of code are half based asm or _hmul2 operations.  An example update
```cpp
    #if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >=530
        float f;
        asm volatile("cvt.f32.f16 %0, %1;\n" : "=f"(f) : "h"(h));
        return f;
    #else
        assert(false);
        return 0;
    #endif
}
```
- Remove any blocksparse functions and files. And comment out utility functions that are used in the blockspase kernels written for FlashAttention since we did not pull in those functions.
- Update gemm_cl  in **/gemm.h to:
```  c++
#if defined(__CUDA_ARCH__) &&  __CUDA_ARCH__ >= 800
    using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;
#elif defined(__CUDA_ARCH__)  && __CUDA_ARCH__ >= 750
    using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;
#else
    assert(0);
    // THIS IS NOT CORRECT BUT THE ASSERT WILL STOP THIS
    using InstructionShape = cutlass::gemm::GemmShape<16, 8, 8>;
    // TD [2022-06-02] We don't support Volta (SM70) yet.
#endif
```
### Reasoning:
FlashAttention is only designed to run on gpus that support sm7.5 or later. However PyTorch is generally build and released using `TORCH_CUDA_ARCH_LIST=5.2,..,8.6`. This means that source code must be compilable for these lower archs even if it is not run. But how are we sure that it won't be run? That should be handled by the runtime dispatch mechanism, specifically here: [check_arch](d70ed68162/aten/src/ATen/native/transformers/cuda/sdp_utils.h (L308))

There is however one edge case for building from source:
User specifies TORCH_CUDA_ARCH_LIST={something less than 7.5} and they are running on a gpu that is >= 7.5 This will cause the runtime dispatcher to think it is okay to run FlashAttention even though the compiled code is bogus.
I tested this with arch=5.3 on an a100 and get the following result:` RuntimeError: CUDA error: no kernel image is available for execution on the device` coming from torch.rand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91994
Approved by: https://github.com/cpuhrsch
2023-01-24 03:11:46 +00:00
1c30844eaa where() function added as a Tensor method as well (#92849)
Fixes #88470

I added the "method" keyword in `aten/src/ATen/native/native_functions.yaml` for the function `where` with Scalar Overload.
This way, you can now use `Tensor.where()` with a scalar parameter the same way `torch.where()` can.

I added a test in `test/test_torch.py` as requested.
It uses the `where()` method on a tensor and then checks it has the same results as the `torch.where()` function.
The test is roughly the same as the one provided by the author of the issue.

PS: this is the second PR I make to resolve this issue, the first one is #92747. I had troubles with commit signatures and is therefore closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92849
Approved by: https://github.com/albanD
2023-01-24 03:09:33 +00:00
fb980581a7 Revert #92688 and #92348 (aot autograd explicitly errors on double backward) (#92863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92863
Approved by: https://github.com/eellison
2023-01-24 03:04:04 +00:00
397b1a3da0 Remove unnecessary includes from python_variable.cpp (#92839)
Follow-up from #92647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92839
Approved by: https://github.com/Skylion007
2023-01-24 02:59:08 +00:00
8c8cd9539d Add missing moves to torch autograd (#92772)
Applies some additional std::move functions to torch/csrc/autograd to opportunities that were found via static analysis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92772
Approved by: https://github.com/ezyang
2023-01-24 02:01:52 +00:00
2a8669c54c ci: Increase timeout for linux binary builds (#92859)
Not entirely sure why conda builds would take 3 hours but failure from https://github.com/pytorch/pytorch/actions/runs/3984411372/jobs/6842256518 seems to indicate that this isn't an issue with the build itself but rather the time limit.

We should _probably_ do an investigation as to why the conda build is taking 3+ hours on a 12 core machine but that's a problem for a different day.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92859
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2023-01-24 01:20:21 +00:00
402c6d4299 Add Meta backend into tensor type strings (#92697)
Add Meta backend into tensor type strings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92697
Approved by: https://github.com/wconstab
2023-01-24 00:47:03 +00:00
dd4b46e010 [PT-D][Checkpoint]rename init() (#92829)
Fixes [#90346](https://github.com/pytorch/pytorch/issues/90346)

Rename init() method in planner to be set_up_planner() to avoid confusion between __init__() and init().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92829
Approved by: https://github.com/kumpera
2023-01-24 00:12:21 +00:00
7560660bd3 Update XLA pin (#92806)
This should allow re-enabling/reverting 3cc1031322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92806
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-01-23 23:58:29 +00:00
57fe33403d [lint] clang-format register_prim_ops_fulljit.cpp (#92150)
Differential Revision: [D42502705](https://our.internmc.facebook.com/intern/diff/D42502705)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92150
Approved by: https://github.com/davidberard98
2023-01-23 23:51:13 +00:00
2cf03bbbab Revert "Run all of the timm models shards in the periodic (#92743)"
This reverts commit de69cedf98ae578f26add662c6387a43cf098066.

Reverted https://github.com/pytorch/pytorch/pull/92743 on behalf of https://github.com/atalman due to This needs to be landed after https://github.com/pytorch/pytorch/pull/92845 and https://github.com/pytorch/pytorch/pull/92846 are landed
2023-01-23 23:44:09 +00:00
d70ed68162 Remove deprecated torch.symeig (#70988)
The time has come to remove deprecated linear algebra related functions. This PR removes `torch.symeig`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70988
Approved by: https://github.com/lezcano, https://github.com/kit1980
2023-01-23 22:51:40 +00:00
dd25111250 [caffe2] Remove OperatorBase::newstyle_outputs_ (#67093)
`OperatorBase` maintains `output_tensors_` and `newstyle_outputs_`
which hold the same list of tensors except one is
`vector<caffe2::Tensor>` and the other is `List<at::Tensor>`.

This instead maintains only `output_tensors_` and handles the
conversions inside of export_caffe2_op_to_c10.

Differential Revision: [D32289811](https://our.internmc.facebook.com/intern/diff/D32289811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67093
Approved by: https://github.com/dagitses, https://github.com/malfet
2023-01-23 22:41:59 +00:00
e137dcc2c8 Splitting #91254 into two PRs (#92748)
This one handles the xnumel=1 part, and introduces no performance
regression.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92748
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-01-23 22:02:14 +00:00
f7e1f3e8bb [PT-D][Checkpoint]Resolve issue #89501: Rename _nested_tensor.py to (#92705)
Fixes https://github.com/pytorch/pytorch/issues/90350.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92705
Approved by: https://github.com/kumpera
2023-01-23 21:45:11 +00:00
9bfd1357d5 Add CUDA 11.8 CI workflows (#92137)
Fixes #92090
CC @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92137
Approved by: https://github.com/atalman
2023-01-23 21:03:53 +00:00
f333885704 Create pt2_bug_report.yml (#92773)
Moves pt2 bug template from dynamo, we want all user issues to be filed in pytorch/pytorch repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92773
Approved by: https://github.com/albanD
2023-01-23 21:00:49 +00:00
3643d5deed Move ASAN and ONNX to Python 3.9 and 3.8 (#92712)
As 3.7 is getting deprecated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92712
Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/seemethere
2023-01-23 20:55:57 +00:00
4e9539e002 [ONNX] Support ListConstruct in quantized_args (#92009)
Fixes #91303

quantized_args didn't support ListConstruct leading to an error when user uses quantized op with list inputs, ex: aten::cat. After this PR, converter can successfully export the issued model and pass ONNX checker. However, ORT doesn't seem to support it with the very same error as https://github.com/microsoft/onnxruntime/issues/12131.

Update:
I find test_quantized_cat_when_concatinating_the_same_tensor is even similar to the new case we have in here. The only difference is whether the inputs are already quantized. ONNX graphs both seem to be valid.
[test_quantized_cat_when_concatinating_the_same_tensor.zip](https://github.com/pytorch/pytorch/files/10396798/test_quantized_cat_when_concatinating_the_same_tensor.zip)
[test_quantized_list_of_inputs_with_cat.zip](https://github.com/pytorch/pytorch/files/10396799/test_quantized_list_of_inputs_with_cat.zip)

issue raised https://github.com/microsoft/onnxruntime/issues/14245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92009
Approved by: https://github.com/BowenBao
2023-01-23 20:55:08 +00:00
df14650f0b [SDPA] Update SDPA API and make function Public (#92189)
# Summary
In preparation for pt 2.0 launch this PR updates SDPA's API and makes the function a nn.funcitonal public function.

## Changes
### API
Previously the the function signature was:
`scaled_dot_product_attention(query, key, value, attn_mask=None, need_attn_weights=False, dropout_p=0.0, is_causal=False) -> (Tensor, Tensor)`
Updated signature:
`scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False) -> Tensor`

This PR removes the need_attn_weights optional boolean variable and updates the return type to a singular tensor.

#### Reasoning:
The main goal of this function is to provide an easy interface for users to call into fused attention kernels e.g.  (FlashAttention). The fused kernels do not currently support arbitrary attn_mask or dropout but there is a PR to mem-efficient attention to enable these. We want to have the API surface ready for when the backing kernels get updated.

The fused kernels save on memory usage by not materializing the weights and it is unlikely that a fast fused implementation will enable this feature so we are removing.

Discussed with folks at FAIR/Xformers and +1 this API change.

#### Make function Public
In preparation for the pt 2.0 launch we make the function public to start to generate user feedback

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92189
Approved by: https://github.com/cpuhrsch
2023-01-23 20:50:46 +00:00
1237cf6b6c Allow direct Tensor constructor to return preexisting PyObject (#92754)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92754
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2023-01-23 20:20:43 +00:00
e994e78397 Added vectorized horizontal flip path for channels last for NcHW (#91806)
## Description

- Added AVX2-only vectorization for horizontal flip op applied on channels last NCHW input, where **2 <= C * sizeof(dtype) <= 16**. PR is a bit faster than Pillow and largely faster (x2 - x5) than Nightly.
- ~Still keeping `cpu_vflip_memcpy` code ([it's PR](https://github.com/pytorch/pytorch/pull/89414) was reverted and is under investigations)~

## Benchmarks

```
[---------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------]
                                                                  |  torch (2.0.0a0+gitf6d73f3) PR  |    Pillow (9.4.0)   |  torch (2.0.0a0+git4386f31) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------
      channels=2, size=256, dtype=torch.uint8, mf=channels_last   |         31.859 (+-0.498)        |                     |          190.599 (+-7.579)
      channels=2, size=520, dtype=torch.uint8, mf=channels_last   |         60.648 (+-0.074)        |                     |          706.895 (+-11.219)
      channels=2, size=712, dtype=torch.uint8, mf=channels_last   |         95.994 (+-2.510)        |                     |         1340.685 (+-169.279)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last   |         45.490 (+-0.108)        |   47.359 (+-0.942)  |          179.520 (+-2.916)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last   |        146.802 (+-2.175)        |  174.201 (+-4.124)  |          707.765 (+-2.691)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last   |        215.148 (+-0.925)        |  313.606 (+-3.972)  |         1346.678 (+-89.854)

      channels=3, size=256, dtype=torch.int8, mf=channels_last    |         43.618 (+-0.160)        |                     |          191.613 (+-16.252)
      channels=3, size=520, dtype=torch.int8, mf=channels_last    |        147.487 (+-0.691)        |                     |          755.020 (+-25.045)
      channels=3, size=712, dtype=torch.int8, mf=channels_last    |        216.687 (+-0.906)        |                     |         1314.854 (+-31.137)

      channels=4, size=256, dtype=torch.uint8, mf=channels_last   |         32.169 (+-0.092)        |                     |          195.415 (+-3.647)
      channels=4, size=520, dtype=torch.uint8, mf=channels_last   |         89.465 (+-0.154)        |                     |          776.459 (+-14.845)
      channels=4, size=712, dtype=torch.uint8, mf=channels_last   |        152.773 (+-0.610)        |                     |         1456.304 (+-45.280)

      channels=8, size=256, dtype=torch.uint8, mf=channels_last   |         43.444 (+-0.158)        |                     |          163.669 (+-4.580)
      channels=8, size=520, dtype=torch.uint8, mf=channels_last   |        151.285 (+-0.602)        |                     |          642.396 (+-13.500)
      channels=8, size=712, dtype=torch.uint8, mf=channels_last   |        278.471 (+-0.912)        |                     |         1205.472 (+-47.609)

      channels=16, size=256, dtype=torch.uint8, mf=channels_last  |         75.176 (+-0.188)        |                     |          181.278 (+-3.388)
      channels=16, size=520, dtype=torch.uint8, mf=channels_last  |        291.105 (+-1.163)        |                     |          716.906 (+-30.842)
      channels=16, size=712, dtype=torch.uint8, mf=channels_last  |        893.267 (+-10.899)       |                     |         1434.931 (+-40.399)

      channels=2, size=256, dtype=torch.int16, mf=channels_last   |         31.437 (+-0.143)        |                     |          195.299 (+-2.916)
      channels=2, size=520, dtype=torch.int16, mf=channels_last   |         89.834 (+-0.175)        |                     |          774.940 (+-8.638)
      channels=2, size=712, dtype=torch.int16, mf=channels_last   |        154.806 (+-0.550)        |                     |         1443.435 (+-37.799)

      channels=3, size=256, dtype=torch.int16, mf=channels_last   |         70.909 (+-0.146)        |                     |          195.347 (+-1.986)
      channels=3, size=520, dtype=torch.int16, mf=channels_last   |        212.998 (+-1.181)        |                     |          776.282 (+-15.598)
      channels=3, size=712, dtype=torch.int16, mf=channels_last   |        382.991 (+-0.968)        |                     |          1441.674 (+-9.873)

      channels=4, size=256, dtype=torch.int16, mf=channels_last   |         43.574 (+-0.157)        |                     |          163.176 (+-1.941)
      channels=4, size=520, dtype=torch.int16, mf=channels_last   |        151.289 (+-0.557)        |                     |          641.169 (+-9.457)
      channels=4, size=712, dtype=torch.int16, mf=channels_last   |        275.275 (+-0.874)        |                     |         1186.589 (+-12.063)

      channels=8, size=256, dtype=torch.int16, mf=channels_last   |         74.455 (+-0.292)        |                     |          181.191 (+-1.721)
      channels=8, size=520, dtype=torch.int16, mf=channels_last   |        289.591 (+-1.134)        |                     |          715.755 (+-2.368)
      channels=8, size=712, dtype=torch.int16, mf=channels_last   |        923.831 (+-68.807)       |                     |         1437.078 (+-14.649)

      channels=2, size=256, dtype=torch.int32, mf=channels_last   |         44.217 (+-0.203)        |                     |          163.011 (+-1.497)
      channels=2, size=520, dtype=torch.int32, mf=channels_last   |        150.920 (+-0.950)        |                     |          640.761 (+-1.882)
      channels=2, size=712, dtype=torch.int32, mf=channels_last   |        281.648 (+-1.163)        |                     |         1188.464 (+-10.374)

      channels=3, size=256, dtype=torch.int32, mf=channels_last   |        103.708 (+-0.517)        |                     |          165.001 (+-1.315)
      channels=3, size=520, dtype=torch.int32, mf=channels_last   |        409.785 (+-8.004)        |                     |          647.939 (+-11.431)
      channels=3, size=712, dtype=torch.int32, mf=channels_last   |        790.819 (+-16.471)       |                     |          1219.206 (+-9.503)

      channels=4, size=256, dtype=torch.int32, mf=channels_last   |         72.975 (+-0.155)        |                     |          181.298 (+-1.059)
      channels=4, size=520, dtype=torch.int32, mf=channels_last   |        291.584 (+-0.905)        |                     |          716.033 (+-4.824)
      channels=4, size=712, dtype=torch.int32, mf=channels_last   |        938.790 (+-15.930)       |                     |         1434.134 (+-15.060)

Times are in microseconds (us).
```

[Source](https://gist.github.com/vfdev-5/8e8c989d35835d7ab20567bff36632be#file-20230123-143303-pr_vs_nightly-md)

## Context:

Follow-up work to PRs : https://github.com/pytorch/pytorch/pull/88989, https://github.com/pytorch/pytorch/pull/89414 and https://github.com/pytorch/pytorch/pull/90013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91806
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-01-23 20:15:30 +00:00
a112814a7f Simplify retains grad hook implementation (#92604)
How the old retains_grad hooks was implemented:
- retains_grad hooks are stored on the autograd_meta, as entries in a vector
- upon registration, a wrapper hook CppFunctionTensorPreHook is created to wrap that vector, and then that wrapper hook is registered to the grad_fn, i.e., by appending it to a vector of retains_grad hooks on the grad_fn
- upon in-place, for the old grad_fn we set the retains_grad hook to nullptr, so that even though the old grad_fn still references the vector, the vector contains a single nullptr. For the new grad_fn, we create a new wrapper hook around the vector (storing the single retains_grad hook) on autograd_meta.

The new retains_grad hook implementation:
- we store std::function by value, and we store it on the grad_fn rather than the autograd_meta
- a single grad_fn can have multiple outputs, so it can potentially hold multiple retains_grad hooks. We use an unordered_map (previously a vector).
- on in-place we remove the hook from the old grad_fn and put it in the new grad_fn (small implication of this change is that  we we now need to have access to both the old grad_fn and new grad_fn, this isn't a problem)

Other details:
- CppFunctionTensorPreHook took a shared_ptr to vector of std::function. In our new implementation, we add a new wrapper hook CppFunctionSingleTensorPreHook, which takes a single std::function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92604
Approved by: https://github.com/albanD
2023-01-23 20:10:46 +00:00
71b1051230 [Docker] Factor GHCR push into its own step (#92832)
As I had a really hard time figuring out what is failing in https://github.com/pytorch/pytorch/actions/runs/3987520975/jobs/6837450121

Together with https://github.com/pytorch/pytorch/pull/92816 it will ensure, that even if ghcr upload fails, CI will continue to work

Per @ZainRizvi suggestion added retry logic for the upload step

Test plan: push temp change(0fe7f8c2ed)  to validate that this portion of the workflow actually doing the job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92832
Approved by: https://github.com/weiwangmeta, https://github.com/ZainRizvi
2023-01-23 19:43:52 +00:00
9f381c9b7f sparse_sparse_matmul: simplify backward (#91712)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91712
Approved by: https://github.com/albanD
2023-01-23 19:24:28 +00:00
36ba2ce546 [BE]: remove old dataclasses install from CI (#92763)
Saw some places we missed some old requirements that are no longer necessary (dataclasses and future). Testing to see if all the CIs still work. We don't need dataclasses anymore now that we are on Python >= 3.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92763
Approved by: https://github.com/ezyang
2023-01-23 18:23:44 +00:00
a43b55e135 A few usability improvements for the dynamo benchmarks. (#92713)
--diff_main renamed to --diff-branch BRANCH and now works again
Summary table splits results per branch.
csv output now has column with branch name when run in this mode

Added --progress flag so you can track how many models are going to be
run.

Example output:
```
$ python benchmarks/dynamo/torchbench.py  --quiet --performance --backend inductor --float16 --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)   --filter 'alexnet|vgg16' --progress  --diff viable/strict
Running model 1/2
batch size: 1024
cuda eval  alexnet                             dynamo_bench_diff_branch   1.251x p=0.00
cuda eval  alexnet                             viable/strict              1.251x p=0.00
Running model 2/2
batch size: 128
cuda eval  vgg16                               dynamo_bench_diff_branch   1.344x p=0.00
cuda eval  vgg16                               viable/strict              1.342x p=0.00

Summary for tag=dynamo_bench_diff_branch:
speedup             gmean=1.30x mean=1.30x
abs_latency         gmean=24.09x mean=25.26x
compilation_latency mean=2.0 seconds
compression_ratio   mean=0.9x

Summary for tag=viable/strict:
speedup             gmean=1.30x mean=1.30x
abs_latency         gmean=24.11x mean=25.29x
compilation_latency mean=0.5 seconds
compression_ratio   mean=1.0x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92713
Approved by: https://github.com/jansel
2023-01-23 18:23:35 +00:00
d40a4540d6 Fix typo under docs directory (#92762)
This PR fixes typo and URL (`http -> https`) in `rst` files under `docs` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92762
Approved by: https://github.com/H-Huang
2023-01-23 18:07:22 +00:00
8f294f785f [FSDP][optim_state_dict] Fix the conditions to check non-parameter associated states (#92744)
If a state is not associated with any parameter, `FSDP.optim_state_dict` should still save it. The current implementation to determine whether a state is associated with a parameter is not completely correct and can cause `use_orig_params=True` have extra states.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92744
Approved by: https://github.com/awgu
2023-01-23 17:40:50 +00:00
d90d92e733 Don't fail-fast Docker builds (#92816)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92816
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-01-23 17:31:30 +00:00
c0dd9b3b67 Revert "[Executorch][Quantization][BE] Refactor Choose Qparams (#92592)"
This reverts commit 59071ab1e71891d480ab77af0d619bc5e01094c2.

It breaks `quantization.jit.test_ondevice_quantization.TestOnDeviceDynamicPTQFinalize`, which is not run in OSS, but is mandatory for internal CI.
2023-01-23 09:13:02 -08:00
9c6433ce48 Revert "Move ASAN and ONNX to Python 3.9 and 3.8 (#92712)"
This reverts commit b5f614c4cd60b5169a8c6b7f9be59de54c25fe72.

Reverted https://github.com/pytorch/pytorch/pull/92712 on behalf of https://github.com/ezyang due to Docker build didn't succeed on master, rolling back so we can try again
2023-01-23 16:02:46 +00:00
2037746e8d [inductor] Rename aot_inductor_debug to aot_eager_decomp_partition (#92314)
Summary: To make the naming more explicit,
  aot eager + decomposition + min_cut partition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92314
Approved by: https://github.com/mlazos
2023-01-23 15:56:48 +00:00
63d6ee7d02 [FSDP][Easy] Remove outdated comment (#92739)
We pass `fully_sharded_module`, not `root_module`, after recent refactoring to unify composable and wrapper FSDP for now. This PR removes the comment explaining why before we passed in `root_module`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92739
Approved by: https://github.com/mrshenli
2023-01-23 15:52:49 +00:00
b88340ac72 [PT-D][Lint] Include nested directories to ufmt (#92779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92779
Approved by: https://github.com/mrshenli, https://github.com/Skylion007
2023-01-23 15:52:36 +00:00
afe6ea884f Revert "[BE][CI] rename .jenkins to .ci, add symlink (#92621)"
This reverts commit 8972a9fe6aa8be8f8035c83094ed371973bfbe73.

Reverted https://github.com/pytorch/pytorch/pull/92621 on behalf of https://github.com/atalman due to breaks shipit
2023-01-23 15:04:58 +00:00
5d66a418de Swap file size on BE platform (#92810)
Fixes #92808

This PR fixes SIGSEGV on a big-endian machine when reading pickle data.

The root cause is not to convert `size`, which is read from a file, from little-endian to big-endian while `size` is used in a method. The fix is to convert `size` on a big-endian machine instead of `nbytes`.

I confirmed that the program in the issue works w/o SIGSEGV and the test passes, with this fix in master branch.

```
$ python test/test_autograd.py TestAutograd.test_pickle
.
----------------------------------------------------------------------
Ran 1 test in 0.010s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92810
Approved by: https://github.com/malfet
2023-01-23 15:02:38 +00:00
4a3fb7bcbc Make CI_SKIPS into a consolidated dict (#92769)
This makes it easier to add more configurations without causing a
thicket of if statements selecting the correct variable.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92769
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
2023-01-23 14:57:18 +00:00
3cfd2fa1c7 Make --inductor imply --backend inductor (#92764)
This is to make some downstream code more uniform (can always ask args.backend for backend)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92764
Approved by: https://github.com/voznesenskym, https://github.com/desertfire
2023-01-23 14:57:18 +00:00
7ddcf4e0c3 Revert "[functorch] vmap: bitwise operators (#91971)"
This reverts commit e54f7b3edde356c97c99706942f4b32a5a5ba475.

Reverted https://github.com/pytorch/pytorch/pull/91971 on behalf of https://github.com/malfet due to Broke functorch bitwise, see e54f7b3edd
2023-01-23 14:52:16 +00:00
fa5be78de1 Cleanup get-workflow-job-id action (#92193)
To be landed few days later then rest of the changes

As workflow can never fail now, no need to retry it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92193
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-01-23 14:47:04 +00:00
b5f614c4cd Move ASAN and ONNX to Python 3.9 and 3.8 (#92712)
As 3.7 is getting deprecated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92712
Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/seemethere
2023-01-23 14:46:02 +00:00
8f3600b966 [RELAND] Add metadata coverage for unsafe_split and unsafe_split_with_sizes (#92802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92802
Approved by: https://github.com/soumith
2023-01-23 10:57:10 +00:00
53ef803705 Make torch.cond work with retracing (#92646)
We simplify the handling of branch submodules by only working with flattened input/output so that there is no need for adjusting in_spec and out_spec in the second round of tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92646
Approved by: https://github.com/zhxchen17, https://github.com/voznesenskym
2023-01-23 09:36:10 +00:00
e54f7b3edd [functorch] vmap: bitwise operators (#91971)
Fixes https://github.com/pytorch/functorch/issues/1069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91971
Approved by: https://github.com/kshitij12345, https://github.com/Chillee
2023-01-23 09:03:13 +00:00
53bfba0d72 [inductor] run CPU and CUDA tests with dynamic shapes (#92667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92667
Approved by: https://github.com/ezyang
2023-01-23 08:54:31 +00:00
30876229a7 [mta] Backward of unary foreach functions (#89591)
as per title, this PR defines backward of those.

This doesn't implement forward-mode automatic differentiation as [the current codegen](a747326423/tools/autograd/gen_variable_type.py (L1513)) doesn't seem to handle `ArrayRef<Tensor>`.

Rel:
- https://github.com/pytorch/pytorch/issues/53796
- https://github.com/pytorch/pytorch/issues/58833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89591
Approved by: https://github.com/albanD
2023-01-23 08:28:06 +00:00
32b2d8009a check if multi_tensor_apply_kernel was called (#92077)
Replacing all the hard coded number of cuda kernel launches with `multi_tensor_apply_kernel` call check, keeping the dependency on kineto profiler there

Rel: https://github.com/pytorch/pytorch/pull/91844#issuecomment-1379844523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92077
Approved by: https://github.com/ngimel
2023-01-23 06:46:36 +00:00
b985c2ef4a [PT-D] Enable init ops for DTensor (#92651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92651
Approved by: https://github.com/wanchaol
2023-01-23 04:38:11 +00:00
20bf77f9bd Fixed virtualized import and typing rule (#92774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92774
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-01-22 22:19:40 +00:00
387d769156 [BE]: Replace string compares with more efficient cpp comparisons (#92765)
Replace cpp string comparisons with more efficient equality operators. These string comparisons are not just more readable, but they also allow for short-circuiting for faster string equality checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92765
Approved by: https://github.com/ezyang
2023-01-22 21:40:19 +00:00
582485bf0f [BE] Use data() method when possible as it's safer and more readable (#92755)
Apply clang-tidy readability-data-pointer fixits. This essentially uses the data() method when possible instead of the less readable `&vec[0]` to get the address of the underlying backing implementation. Not only is this more readable, it is safer as it allows you to retrieve the pointer even when the std::vector or std::string is empty without throwing an index error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92755
Approved by: https://github.com/ezyang
2023-01-22 20:05:41 +00:00
b847ac227f Fix typo in buckbuild.bzl (#92751)
accomodate -> accommodate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92751
Approved by: https://github.com/Skylion007
2023-01-22 17:35:38 +00:00
c52567ec18 Switch CI exclusions to use exact match. (#92761)
Since the CI exclusions are hard-coded in our script, we might as well require them to match exactly. This solved some head scratching where I was like, "this model is not obviously excluded, why is it not showing up in CI."

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92761
Approved by: https://github.com/jansel
2023-01-22 17:10:20 +00:00
e57a694d77 Add some missing moves to torch jit passes (#92317)
Add some missing moves in torch/jit/passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92317
Approved by: https://github.com/ezyang
2023-01-22 16:33:08 +00:00
cfaa1bace3 A bunch of fixes for Inductor + dynamic shapes enablement (#92609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92609
Approved by: https://github.com/ezyang
2023-01-22 15:22:08 +00:00
2f6a975f25 Remove cffi dependency as it doesn't look like we're using it (#92738)
Maybe this will go horribly wrong in CI but works fine without it locally!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92738
Approved by: https://github.com/kit1980, https://github.com/seemethere
2023-01-22 15:03:52 +00:00
0d9de46d9c Revert "Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)"
This reverts commit 36e1f7bc2b1e399808173dacb9aa1ea8b89fbbbf.

Reverted https://github.com/pytorch/pytorch/pull/92608 on behalf of https://github.com/ezyang due to test_aot_autograd_symbolic_exhaustive_unsafe_split_cpu_float32 (main.TestEagerFusionOpInfoCPU) is now xpass
2023-01-22 13:57:31 +00:00
36e1f7bc2b Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92608
Approved by: https://github.com/ngimel
2023-01-22 07:12:29 +00:00
6016e4c707 [quant][fx][refactor] Rename modules to named_modules (#92575)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92575
Approved by: https://github.com/jcaip
2023-01-22 04:53:03 +00:00
ed07070a11 Restore lint after PR 92637 (#92759)
https://github.com/pytorch/pytorch/pull/92637 broke lint, can't easily revert because of merge conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92759
Approved by: https://github.com/ezyang
2023-01-22 04:03:04 +00:00
6bc62a6392 Revert "[inductor] run CPU and CUDA tests with dynamic shapes (#92667)"
This reverts commit 425e506ffe41fc9fd16a18175c992f9d01eef08b.

Reverted https://github.com/pytorch/pytorch/pull/92667 on behalf of https://github.com/kit1980 due to test_topk_dynamic_shapes_cpu failing after this PR
2023-01-22 03:43:57 +00:00
93e71cc2f5 Add helpers for running tests and then putting them in a CSV (#92642)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92642
Approved by: https://github.com/albanD
2023-01-22 02:00:39 +00:00
756acd3fa1 Guard solve behind mod for symbolic shapes (#92597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92597
Approved by: https://github.com/ezyang
2023-01-22 00:29:56 +00:00
363ca57d02 Remove is_aot_autograd_safe_to_run (#91927)
This should be alright to remove now, because we:

1) Support LSTM
2) AOT_Autograd can cover its own mutation detection

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91927
Approved by: https://github.com/Chillee, https://github.com/bdhirsh
2023-01-21 23:54:48 +00:00
fb776a2df1 Fix mistaken script merge (by me) (#92756)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92756
Approved by: https://github.com/Chillee
2023-01-21 22:19:02 +00:00
425e506ffe [inductor] run CPU and CUDA tests with dynamic shapes (#92667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92667
Approved by: https://github.com/ezyang
2023-01-21 22:03:41 +00:00
5c4f0fd72c Change convolution to use symbolic shapes for propagation (#92397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92397
Approved by: https://github.com/ezyang
2023-01-21 21:54:24 +00:00
97342ae04b Fix python tensor hooks behavior on inplace (#92734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92734
Approved by: https://github.com/albanD
2023-01-21 21:32:37 +00:00
de69cedf98 Run all of the timm models shards in the periodic (#92743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92743
Approved by: https://github.com/kit1980
2023-01-21 18:39:17 +00:00
bea0b5ba73 [BE] Delete unused docker configs (#92711)
CUDA-10.2 is long gone and CUDA-11.3+clang build is replaced by cuda-11.6+clang10 jammy build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92711
Approved by: https://github.com/weiwangmeta
2023-01-21 16:42:28 +00:00
020c0d5895 Add debugability comments to DDPOptimizer (#89802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89802
Approved by: https://github.com/davidberard98
2023-01-21 15:07:28 +00:00
5778c04a15 Add --timing flag, phase timing to @dynamo_timed (#92637)
Ex output:
```
 TIMING:
 entire_frame_compile:8.574629999999999
 backend_compile:5.26806
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637
Approved by: https://github.com/ezyang
2023-01-21 10:52:13 +00:00
27bf879b8c Forward fix: restore sebotnet33ts_256 aot_eager skip (#92741)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92741
Approved by: https://github.com/kit1980
2023-01-21 08:10:23 +00:00
3cc1031322 Mark XLA Linux jobs as unstable temporarily (#92634)
To be reverted once the issue is mitigated https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20AtenXlaTensorTest.TestFrobeniusNormInDims

Caused by https://github.com/pytorch/pytorch/pull/81763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92634
Approved by: https://github.com/ZainRizvi
2023-01-21 06:31:19 +00:00
cyy
e4d81a9ec9 fix various pointer issues (#90651)
Fix some issues found by static analyser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90651
Approved by: https://github.com/Skylion007
2023-01-21 06:26:41 +00:00
0ab4ab9f8d [Dynamo] Fix calling UserDefinedObject.func should pass self object (#92050)
Fixes #90834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92050
Approved by: https://github.com/jansel
2023-01-21 05:47:01 +00:00
0d870b50d3 [optim][nadam] group tensors in foreach, make it default (#92715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92715
Approved by: https://github.com/albanD
2023-01-21 05:43:37 +00:00
9ccf9362c2 [optim][rprop] default to foreach when CUDA + differentiable=False (#92728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92728
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
c628654724 [optim][rmsprop] default to foreach when CUDA + differentiable=False (#92727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92727
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
7277247a8c [optim][radam] default to foreach when CUDA + differentiable=False (#92726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92726
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
9f356568ab [optim][asgd] default to foreach when CUDA + differentiable=False (#92724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92724
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
30bda6b12b [optim][adamax] default to foreach when CUDA + differentiable=False (#92723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92723
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
9b4a778420 [optim][adagrad] default to foreach when CUDA + differentiable=False (#92716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92716
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
6f1727b288 Print aot graphs if user specifies aot graph env vars (#92720)
When integrating AOT logging with TorchInductor trace, the ability to print graphs to the console if the user specified any of the env vars was removed (in favor of using TORCH_COMPILE_DEBUG). This restores this by checking if the user set any of the aot debug variables *before* setting up the remainder of the logging, and adding a stream to stdout if any of those env vars are set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92720
Approved by: https://github.com/Chillee
2023-01-21 04:46:35 +00:00
c0fe41f983 Use SymBool for is_contiguous computation (#92229)
This changes TensorImpl to store SymBool instead of bool. However, it doesn't actually compute these quantities symbolically (outside of some top level disjunctions.) The purpose of this PR is to make it easier to diagnose performance problems in the next PR, as after this change we can switch to guardless implementations without modifying TensorImpl.h

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92229
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-21 04:01:00 +00:00
011df6630c [vision hash update] update the pinned vision hash (#92732)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92732
Approved by: https://github.com/pytorchbot
2023-01-21 03:42:12 +00:00
d2728bb6a7 [functorch] add is_any_true (#92686)
Adds `is_any_true` similar to `is_all_true` (https://github.com/pytorch/pytorch/pull/89097/files)

This would unblock https://github.com/pytorch/functorch/issues/1049
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92686
Approved by: https://github.com/Chillee
2023-01-21 03:36:05 +00:00
e6a8267cf5 [pt2.0/inductor] Fix race in cache dir across ranks on the same host (#92664)
Summary:
It looks we have some race in the cache directory for triton codegen, when we have multiple processes on the same host:
1. Rank A and B cannot find the code in cache (/tmp/uid/triton/cache) and start compilation separately
2. Most of the times the codegen is the same; but rarely it may produce different llir and different shared memory (in our case it's 544 and 2560, both are valid for the llir/ptx generated). See repro D42584580
3. They both write the compiled so and metadata into the local cache folder, with the same directory name (same hash, without considering device id). There will be a race here even if they grab the file lock, because it only locks each file but not the entire transaction
4. We then load the so and meta data back from the file. What happens can be we load so from rank A and shared memory from rank B and they mismatch.

Test Plan:
Run the faulty program to double check
```
[trainer5]: cache dir: /tmp/root/4951/triton/cache/198ef4405d2e525acd20d5c2d01ad099
[trainer1]: cache dir: /tmp/root/4947/triton/cache/198ef4405d2e525acd20d5c2d01ad099
```

Differential Revision: D42619405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92664
Approved by: https://github.com/bertmaher, https://github.com/ngimel, https://github.com/jansel
2023-01-21 03:22:12 +00:00
8972a9fe6a [BE][CI] rename .jenkins to .ci, add symlink (#92621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92621
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-21 02:40:18 +00:00
09eb4c2a70 Revert "Update Module.__setattr__ to respect property setters (#92044)"
This reverts commit 0c8f4b58934cbfe4a52d261c914ff8b2632c4f5c.

Reverted https://github.com/pytorch/pytorch/pull/92044 on behalf of https://github.com/saitcakmak due to Caused regressions in a Meta internal model
2023-01-21 02:39:21 +00:00
cyy
85851b1e8f remove useless clang-tidy suppression (#92287)
remove NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
remove NOLINTNEXTLINE(performance-move-const-arg)
remove NOLINTNEXTLINE(performance-no-automatic-move)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92287
Approved by: https://github.com/albanD
2023-01-21 02:33:24 +00:00
5489b32337 Add periodic job to test aot_eager on benchmarks suite. (#92695)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92695
Approved by: https://github.com/desertfire, https://github.com/albanD
2023-01-21 02:29:22 +00:00
9ad0aca6e5 Update aot_eager CI failures (#92696)
Based on https://hud.pytorch.org/pr/92689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92696
Approved by: https://github.com/desertfire
2023-01-21 02:29:22 +00:00
1bf512017e Refactor test_inductor_benchmark into test_single_dynamo_benchmark helper (#92665)
I need this because I'm going to add a few more configurations
(not enabled by default, but to be run on periodic) and having this
better factored will make it easier.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92665
Approved by: https://github.com/Chillee, https://github.com/desertfire
2023-01-21 02:29:22 +00:00
85a1f0223a Add a warning about performance cost of set_default_device (#92703)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92703
Approved by: https://github.com/albanD
2023-01-21 02:23:13 +00:00
5c6f5439b7 Implement SymBool (#92149)
We have known for a while that we should in principle support SymBool as a separate concept from SymInt and SymFloat ( in particular, every distinct numeric type should get its own API). However, recent work with unbacked SymInts in, e.g., https://github.com/pytorch/pytorch/pull/90985 have made this a priority to implement. The essential problem is that our logic for computing the contiguity of tensors performs branches on the passed in input sizes, and this causes us to require guards when constructing tensors from unbacked SymInts. Morally, this should not be a big deal because, we only really care about the regular (non-channels-last) contiguity of the tensor, which should be guaranteed since most people aren't calling `empty_strided` on the tensor, however, because we store a bool (not a SymBool, prior to this PR it doesn't exist) on TensorImpl, we are forced to *immediately* compute these values, even if the value ends up not being used at all. In particular, even when a user allocates a contiguous tensor, we still must compute channels-last contiguity (as some contiguous tensors are also channels-last contiguous, but others are not.)

This PR implements SymBool, and makes TensorImpl use SymBool to store the contiguity information in ExtraMeta. There are a number of knock on effects, which I now discuss below.

* I introduce a new C++ type SymBool, analogous to SymInt and SymFloat. This type supports logical and, logical or and logical negation. I support the bitwise operations on this class (but not the conventional logic operators) to make it clear that logical operations on SymBool are NOT short-circuiting. I also, for now, do NOT support implicit conversion of SymBool to bool (creating a guard in this case). This does matter too much in practice, as in this PR I did not modify the equality operations (e.g., `==` on SymInt) to return SymBool, so all preexisting implicit guards did not need to be changed. I also introduced symbolic comparison functions `sym_eq`, etc. on SymInt to make it possible to create SymBool. The current implementation of comparison functions makes it unfortunately easy to accidentally introduce guards when you do not mean to (as both `s0 == s1` and `s0.sym_eq(s1)` are valid spellings of equality operation); in the short term, I intend to prevent excess guarding in this situation by unit testing; in the long term making the equality operators return SymBool is probably the correct fix.
* ~~I modify TensorImpl to store SymBool for the `is_contiguous` fields and friends on `ExtraMeta`. In practice, this essentially meant reverting most of the changes from https://github.com/pytorch/pytorch/pull/85936 . In particular, the fields on ExtraMeta are no longer strongly typed; at the time I was particularly concerned about the giant lambda I was using as the setter getting a desynchronized argument order, but now that I have individual setters for each field the only "big list" of boolean arguments is in the constructor of ExtraMeta, which seems like an acceptable risk. The semantics of TensorImpl are now that we guard only when you actually attempt to access the contiguity of the tensor via, e.g., `is_contiguous`. By in large, the contiguity calculation in the implementations now needs to be duplicated (as the boolean version can short circuit, but the SymBool version cannot); you should carefully review the duplicate new implementations. I typically use the `identity` template to disambiguate which version of the function I need, and rely on overloading to allow for implementation sharing. The changes to the `compute_` functions are particularly interesting; for most of the functions, I preserved their original non-symbolic implementation, and then introduce a new symbolic implementation that is branch-less (making use of our new SymBool operations). However, `compute_non_overlapping_and_dense` is special, see next bullet.~~ This appears to cause performance problems, so I am leaving this to an update PR.
* (Update: the Python side pieces for this are still in this PR, but they are not wired up until later PRs.) While the contiguity calculations are relatively easy to write in a branch-free way, `compute_non_overlapping_and_dense` is not: it involves a sort on the strides. While in principle we can still make it go through by using a data oblivious sorting network, this seems like too much complication for a field that is likely never used (because typically, it will be obvious that a tensor is non overlapping and dense, because the tensor is contiguous.) So we take a different approach: instead of trying to trace through the logic computation of non-overlapping and dense, we instead introduce a new opaque operator IsNonOverlappingAndDenseIndicator which represents all of the compute that would have been done here. This function returns an integer 0 if `is_non_overlapping_and_dense` would have returned `False`, and an integer 1 otherwise, for technical reasons (Sympy does not easily allow defining custom functions that return booleans). The function itself only knows how to evaluate itself if all of its arguments are integers; otherwise it is left unevaluated. This means we can always guard on it (as `size_hint` will always be able to evaluate through it), but otherwise its insides are left a black box. We typically do NOT expect this custom function to show up in actual boolean expressions, because we will typically shortcut it due to the tensor being contiguous. It's possible we should apply this treatment to all of the other `compute_` operations, more investigation necessary. As a technical note, because this operator takes a pair of a list of SymInts, we need to support converting `ArrayRef<SymNode>` to Python, and I also unpack the pair of lists into a single list because I don't know if Sympy operations can actually validly take lists of Sympy expressions as inputs. See for example `_make_node_sizes_strides`
* On the Python side, we also introduce a SymBool class, and update SymNode to track bool as a valid pytype. There is some subtlety here: bool is a subclass of int, so one has to be careful about `isinstance` checks (in fact, in most cases I replaced `isinstance(x, int)` with `type(x) is int` for expressly this reason.) Additionally, unlike, C++, I do NOT define bitwise inverse on SymBool, because it does not do the correct thing when run on booleans, e.g., `~True` is `-2`. (For that matter, they don't do the right thing in C++ either, but at least in principle the compiler can warn you about it with `-Wbool-operation`, and so the rule is simple in C++; only use logical operations if the types are statically known to be SymBool). Alas, logical negation is not overrideable, so we have to introduce `sym_not` which must be used in place of `not` whenever a SymBool can turn up. To avoid confusion with `__not__` which may imply that `operators.__not__` might be acceptable to use (it isn't), our magic method is called `__sym_not__`. The other bitwise operators `&` and `|` do the right thing with booleans and are acceptable to use.
* There is some annoyance working with booleans in Sympy. Unlike int and float, booleans live in their own algebra and they support less operations than regular numbers. In particular, `sympy.expand` does not work on them. To get around this, I introduce `safe_expand` which only calls expand on operations which are known to be expandable.

TODO: this PR appears to greatly regress performance of symbolic reasoning. In particular, `python test/functorch/test_aotdispatch.py -k max_pool2d` performs really poorly with these changes. Need to investigate.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92149
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-01-21 02:21:56 +00:00
34e8eb229d Dispatch the auxiliary frobenius_norm and nuclear_norm to better implementations and deprecate them (#81763)
These functions will be legacy functions. We deprecate them, but we also
take this chance to dispatch to a more efficient and consistent implementation.
Doing so should help writing a conversion rule for these to be able to
remove them once and for all

Differential Revision: [D42354776](https://our.internmc.facebook.com/intern/diff/D42354776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81763
Approved by: https://github.com/ngimel
2023-01-21 01:03:50 +00:00
1af40d5108 [cublas][cublasLt] Fall back to unfused addmm for 2-byte-aligned inputs (#92201)
Fix for this issue surfaced from the discuss forum: https://discuss.pytorch.org/t/cuda-error-cublas-status-not-supported-when-calling-cublasltmatmul-from-torch-nn-functional-linear/170214

Note that PyTorch builds before #71200 should not be affected as there was no `cublasLt` dispatch path. Additionally, the provided repro has the quirk of using a 3D input, which means it will not dispatch to `cublasLt`-backed `addmm` until builds that include #72728. Changing the input to 2D by trivially removing the size `1` dimension will surface the failure on builds after #71200.

Interestingly, the use-case where _all_ inputs are 2-byte aligned are supported (runs without crashing), but when some are > 2-byte and some are == 2-byte are not. This behavior suggests that the `cuBlastLt` heuristics are incorrect, as the heuristic function has visibility of the raw pointer values via the descriptors when it is called.

We will follow up with `cuBlasLt` but this fix is needed to prevent unnecessary crashes for now.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92201
Approved by: https://github.com/ngimel
2023-01-21 00:32:02 +00:00
a74c8df7cd [quant][fx][pt2e][be] Store node_name_to_target_dtype to node.meta["target_dtype_info"] (#92574)
Summary:
This is in preparation for quantize_pt2e API where we allow programability for users to set how
they want to quantize their model

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92574
Approved by: https://github.com/jcaip
2023-01-21 00:27:15 +00:00
de0375e79d [optim][foreach] Do NOT inplace modify gradients (#92706)
SGD and ASGD already had out-of-place grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92706
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-01-21 00:12:28 +00:00
2b885e1f6c [optim][NAdam] Fix discrepancy between mt vs st impl (#92699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92699
Approved by: https://github.com/albanD
2023-01-21 00:12:28 +00:00
896b6d8768 fix the formatting of runtime error msg in prims _cat_meta (#92124)
Summary:
easy fix on formatting.  for example,
> BackendCompilerFailed: compile_fx raised RuntimeError: Sizes of tensors must match except in dimension 0. Expected {common_length} but got {length} for tensor number {tensor_idx} in the list

Reviewed By: Yuzhen11

Differential Revision: D42491648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92124
Approved by: https://github.com/malfet
2023-01-20 23:26:02 +00:00
703265e599 Shard mac to 3 (#91277)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91277
Approved by: https://github.com/huydhn
2023-01-20 22:51:23 +00:00
d6c3468f70 Don't allow recomputing a node that *must* be materialized in the backwards pass (#90896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90896
Approved by: https://github.com/ezyang
2023-01-20 22:34:41 +00:00
97b7e4cdd5 Fix GroupNorm backward prop on CUDA (#92671)
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/89485

Adds test to prevent those regressions from happening in the future In process, discovered that GroupNormBackwards on CPU does not produce the same results if input and gradient memory_format is different

Fixes #92166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92671
Approved by: https://github.com/ngimel, https://github.com/xuzhao9
2023-01-20 22:22:01 +00:00
8c0289a61c [CUDA][CUBLAS][BFloat16] Tenatively disable reduced precision reductions for some matmul tests (#92599)
We've observed some failures in numerical checks on newer compute capabilities stemming from cuBLAS allowing reduced precision reductions.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92599
Approved by: https://github.com/ngimel
2023-01-20 22:19:11 +00:00
5644059489 [inductor] Lower torch.exp2 and use it for torch.pow(2, x) (#92632)
Before
```python
    tmp0 = 2.0
    tmp2 = tl.libdevice.pow(tmp0, tmp1)
```

After
```python
    tmp1 = tl.libdevice.exp2(tmp0)
```

I've benchmarked on CPU and CUDA with the following examples
```
@torch._dynamo.optimize()
def exp2(x):
    return torch.pow(2, x)

@torch._dynamo.optimize()
def logaddexp2(a, b):
    m = torch.maximum(a, b)
    return m + torch.log2(1 + torch.pow(2, -torch.abs(a-b)))
```

triton is able to specialize `pow(2, x)` such that this makes
no difference, but on CPU I see a surprisingly large speedup.

| device | Function  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------------|--------------|---------|
| CUDA   | exp2      | 64          | 63           | 1.0     |
|        | logaddexp | 109         | 107          | 1.0     |
| CPU    | exp2      | 220         | 40           | 5.5     |
|        | logaddexp | 282         | 140          | 2.0     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92632
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-01-20 22:06:23 +00:00
5a1344407a Add GHA side support for ciflow/inductor-perf-test-nightly (#92693)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92693
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2023-01-20 22:01:24 +00:00
a3efa9d740 Create autograd Function for aot_autograd backward only when needed (#92688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92688
Approved by: https://github.com/bdhirsh
2023-01-20 21:55:23 +00:00
eee2869ea7 [PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553)
Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory".

This is the fix for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553
Approved by: https://github.com/kumpera
2023-01-20 21:54:04 +00:00
e4d83d54a6 Foreach gradient clipping (#91846)
Faster gradient clipping using the foreach functions

```
[------------------------ (tensors, scalar) -------------------------]
                                   |  without foreach  |  with foreach |    apex
1 threads: ----------------------------------------------------------------------
      10 tensors of size 4         |         120.5     |       61.1    |     50.3
      100 tensors of size 4        |         946.2     |      239.5    |    136.3
      1000 tensors of size 4       |        9808.5     |     2151.1    |   1006.9
      10000 tensors of size 4      |       96871.2     |    22637.4    |  10119.1
      10 tensors of size 16        |         121.0     |       64.1    |     52.5
      100 tensors of size 16       |         993.4     |      252.6    |    136.7
      1000 tensors of size 16      |        9427.7     |     2151.2    |   1049.5
      10000 tensors of size 16     |       97437.1     |    22203.1    |  10340.0
      10 tensors of size 256       |         118.9     |       62.3    |     51.5
      100 tensors of size 256      |         955.2     |      243.1    |    134.2
      1000 tensors of size 256     |        9374.9     |     2140.7    |   1009.6
      10000 tensors of size 256    |       95302.5     |    21849.4    |  10215.5
      10 tensors of size 65536     |         118.5     |       62.4    |     51.1
      100 tensors of size 65536    |        1740.7     |      243.3    |    225.3
      1000 tensors of size 65536   |       17364.1     |     2228.7    |   2004.5
      10000 tensors of size 65536  |      177510.1     |    25410.4    |  20678.2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91846
Approved by: https://github.com/janeyx99
2023-01-20 21:43:29 +00:00
44b7a0b7ef Clean up argparser help (benchmarks/dynamo/distributed.py) (#92687)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92687
Approved by: https://github.com/davidberard98
2023-01-20 21:23:49 +00:00
9db4323e4c Deprecate capture hooks except distributed use case (#92653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92653
Approved by: https://github.com/albanD
2023-01-20 20:51:46 +00:00
c4501593c3 Delete get_pyobj() entirely (#92638)
Opt for the shorter and more direct node attribute access.

I need to do this because I'm going to publicly document
SymInt and SymFloat but I don't want to doc get_pyobj().

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92638
Approved by: https://github.com/Chillee, https://github.com/albanD, https://github.com/voznesenskym, https://github.com/bdhirsh
2023-01-20 19:06:56 +00:00
5610766044 Mark test monitoring as an optional process (#92658)
This is an optional step that is ok to ignored when PyPI becomes flaky.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92658
Approved by: https://github.com/clee2000
2023-01-20 18:59:56 +00:00
8b3e35ea4a Revert "Run dynamo/test_dynamic_shapes serially (#92215)"
This reverts commit ea1007b89cb86551c80ddfd38db0bb3ade32140b.

Reverted https://github.com/pytorch/pytorch/pull/92215 on behalf of https://github.com/huydhn due to This is not needed anymore as https://github.com/pytorch/pytorch/issues/92196 has been root caused to test ordering
2023-01-20 18:54:13 +00:00
fb3d9f39cc update vmap to accept nones (#91644)
* Fixes https://github.com/pytorch/functorch/issues/1082
* Fixes https://github.com/pytorch/functorch/issues/439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91644
Approved by: https://github.com/kshitij12345, https://github.com/Chillee
2023-01-20 18:25:22 +00:00
2fb328eb46 [Dynamo] Preserve source_fn in node.meta (#92399)
Sample value from the test case `test_export_with_stack_trace`

node.target | node.meta["source_fn"]
-- | --
aten.randn.default | <built-in method randn of type object at 0x7f8683263108>
aten.t.default | < built-in function linear >
aten.mm.default | < built-in function linear >
aten.cos.default | <built-in method cos of type object at 0x7f8683263108>
aten.relu.default | relu
aten.add.Tensor | < built-in function add >

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92399
Approved by: https://github.com/jerryzh168, https://github.com/yanboliang
2023-01-20 18:23:39 +00:00
dd760c98f8 [decomp] Use new squeeze.dims overload in decompositions (#91602)
This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602
Approved by: https://github.com/ngimel
2023-01-20 18:08:18 +00:00
2af2952c66 logaddexp2: Use log1p and exp2 (#92116)
This replaces `log2(1 + x)` with `log1p(x) * (1 / log(2))` which improves
precision when `x` is small by avoiding the truncation from calculating
`(1 + x) - 1`. Noting that `x` is always `<= 1` in this formula.

This also replaces `pow(2, x)` with `exp2(x)` which improves performance,
particularly on CPU where the constant value cannot be inlined into Sleef.
With numel=1e7 for example, I see a 1.35x speedup on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92116
Approved by: https://github.com/lezcano
2023-01-20 18:04:27 +00:00
67bb5236da lint fix (#92685)
This linter error was introduced in https://github.com/pytorch/pytorch/pull/91821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92685
Approved by: https://github.com/weiwangmeta, https://github.com/malfet
2023-01-20 17:26:37 +00:00
2891cecd8d Revert "Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)"
This reverts commit 4386f317b92a400cabc6a25b5849466475eec1a9.

Reverted https://github.com/pytorch/pytorch/pull/92608 on behalf of https://github.com/ZainRizvi due to test_aot_autograd_symbolic_exhaustive_unsafe_split_cpu_float32 (__main__.TestEagerFusionOpInfoCPU) is failing consistently since this PR was merged
2023-01-20 17:17:35 +00:00
215f4fc355 Update android/README.md, how to build pytorch android from source (#92356)
`sh ./scripts/build_pytorch_android.sh` leads to
```
./scripts/build_pytorch_android.sh: 30: source: not found
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92356
Approved by: https://github.com/soulitzer
2023-01-20 16:39:31 +00:00
b2ca2c8662 [optim][adagrad] group tensors in foreach to maximize perf (#92362)
another one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92362
Approved by: https://github.com/albanD
2023-01-20 16:24:39 +00:00
44132cc4b0 Revert "Add --timing flag, phase timing to @dynamo_timed (#92637)"
This reverts commit 773b5134359ae21957e1f5a37eb2cee620c74029.

Reverted https://github.com/pytorch/pytorch/pull/92637 on behalf of https://github.com/malfet due to Broke lint
2023-01-20 16:23:20 +00:00
5ac22782d1 Optimized vertical flip using memcpy (#89414)
## Description

- Use memcpy for vertical flip
- Added bool type support for horizontal flip
  - channels last input with horizontal flip goes also into cpu_vflip_memcpy and has a speed-up

Previous PRs:
- https://github.com/pytorch/pytorch/pull/90013
- https://github.com/pytorch/pytorch/pull/88989

## Results

### Horizontal flip

- AVX2 (channels last input only)
```
[------------------------------------------------------------------------- Horizontal flip -------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)   |  torch (1.14.0a0+gitb0bd5c4) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |        204.813 (+-1.018)         |                     |           308.070 (+-1.573)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        844.523 (+-2.302)         |                     |           1226.801 (+-5.069)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |        2246.512 (+-8.935)        |                     |          2689.692 (+-22.654)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         21.024 (+-0.083)         |   44.196 (+-0.131)  |            22.564 (+-0.066)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         71.806 (+-0.150)         |  166.653 (+-0.789)  |            72.660 (+-0.160)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        129.354 (+-0.385)         |  306.998 (+-0.819)  |           130.094 (+-0.274)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |        177.250 (+-0.485)         |   44.232 (+-0.465)  |           289.201 (+-2.837)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |        699.055 (+-1.940)         |  166.540 (+-0.903)  |           1172.747 (+-3.645)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |        1302.968 (+-5.390)        |  307.210 (+-0.852)  |          2149.396 (+-23.570)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         11.943 (+-0.079)         |                     |            12.451 (+-0.033)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         39.830 (+-0.093)         |                     |            40.583 (+-0.070)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         69.001 (+-0.078)         |                     |            69.590 (+-0.162)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |        177.378 (+-0.507)         |                     |           283.461 (+-2.957)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |        698.915 (+-1.840)         |                     |          1061.208 (+-10.449)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |        1299.365 (+-3.919)        |                     |          1957.424 (+-13.149)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         17.955 (+-0.077)         |                     |            89.456 (+-0.285)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         56.901 (+-0.081)         |                     |           339.802 (+-0.879)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        103.629 (+-0.256)         |                     |           627.845 (+-1.185)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         21.179 (+-0.077)         |   44.146 (+-0.260)  |            22.957 (+-0.138)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         71.685 (+-0.155)         |  166.666 (+-0.730)  |            72.606 (+-0.124)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        129.168 (+-0.288)         |  307.094 (+-1.571)  |           130.156 (+-0.453)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         33.049 (+-0.089)         |                     |            33.056 (+-0.477)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |        116.635 (+-0.299)         |                     |           113.433 (+-0.891)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |        212.134 (+-0.413)         |                     |           204.394 (+-0.822)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |        207.214 (+-0.586)         |                     |           302.370 (+-0.670)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        846.553 (+-2.301)         |                     |           1223.851 (+-5.280)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |        2251.687 (+-6.513)        |                     |          2711.557 (+-14.011)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         33.237 (+-0.072)         |                     |            33.101 (+-0.070)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |        113.605 (+-0.337)         |                     |           117.067 (+-0.547)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |        204.632 (+-0.487)         |                     |           212.590 (+-0.848)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         7.950 (+-0.030)          |                     |            37.757 (+-0.080)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         23.799 (+-0.080)         |                     |           136.571 (+-0.441)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         37.970 (+-0.075)         |                     |           246.894 (+-0.926)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         8.009 (+-0.077)          |                     |            37.800 (+-0.100)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         23.861 (+-0.099)         |                     |           136.553 (+-0.519)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         38.211 (+-0.104)         |                     |           246.939 (+-0.692)

Times are in microseconds (us).
```
[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100405-pr_vs_nightly-md)

- AVX512 (channels last input only)
```
[---------------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)    |  torch (1.14.0.dev20221208+cu116) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |        194.708 (+-9.566)         |                      |             372.067 (+-12.430)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        765.151 (+-10.098)        |                      |            1524.231 (+-111.283)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |       1587.229 (+-88.117)        |                      |            2950.081 (+-92.322)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         13.328 (+-0.375)         |   49.693 (+-1.193)   |              10.323 (+-0.333)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         90.580 (+-0.812)         |  191.936 (+-4.369)   |              92.269 (+-0.980)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        163.821 (+-3.174)         |  352.053 (+-10.909)  |             165.661 (+-4.436)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |        206.862 (+-4.417)         |   49.336 (+-1.492)   |             287.373 (+-7.266)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |        829.736 (+-15.857)        |  191.489 (+-5.645)   |            1166.126 (+-45.667)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |       1540.953 (+-28.269)        |  352.171 (+-8.784)   |            2171.570 (+-82.740)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         7.856 (+-0.131)          |                      |              7.943 (+-0.148)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         34.750 (+-1.195)         |                      |              36.309 (+-0.716)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         85.858 (+-0.729)         |                      |              87.306 (+-0.981)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |        206.896 (+-5.716)         |                      |             262.551 (+-6.598)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |        828.212 (+-13.441)        |                      |            1077.916 (+-28.810)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |       1542.748 (+-31.379)        |                      |            2003.661 (+-71.614)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         11.038 (+-0.271)         |                      |             126.867 (+-5.590)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         90.190 (+-1.185)         |                      |             501.446 (+-13.498)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        165.797 (+-3.016)         |                      |             921.131 (+-20.500)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         13.516 (+-0.578)         |   49.678 (+-1.966)   |              10.360 (+-0.256)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         91.195 (+-0.830)         |  191.778 (+-4.742)   |              91.117 (+-0.855)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        168.551 (+-3.352)         |  351.585 (+-8.230)   |             164.199 (+-3.725)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         35.832 (+-0.840)         |                      |              35.087 (+-0.972)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |        133.624 (+-5.293)         |                      |             131.423 (+-6.002)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |        240.702 (+-5.213)         |                      |             236.876 (+-7.867)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |        192.351 (+-6.740)         |                      |             313.999 (+-12.141)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        766.553 (+-16.669)        |                      |            1270.797 (+-49.828)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |       1501.700 (+-69.499)        |                      |            2427.303 (+-126.694)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         35.386 (+-0.801)         |                      |              34.539 (+-0.844)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |        132.369 (+-4.107)         |                      |             130.926 (+-3.597)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |        237.722 (+-6.680)         |                      |             237.072 (+-5.027)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         6.796 (+-0.132)          |                      |              44.727 (+-0.905)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         24.827 (+-0.669)         |                      |             166.758 (+-5.141)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         42.392 (+-0.980)         |                      |             310.830 (+-6.130)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         8.114 (+-0.141)          |                      |              44.776 (+-0.707)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         24.787 (+-0.787)         |                      |             167.766 (+-5.004)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         42.545 (+-0.636)         |                      |             313.715 (+-7.603)

Times are in microseconds (us).
```
[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105633-pr_vs_nightly-avx512-md)

### Vertical flip

- AVX2 (all tested cases showing speed-up or same perfs)
```
[-------------------------------------------------------------------------- Vertical flip --------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)   |  torch (1.14.0a0+gitb0bd5c4) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |         93.125 (+-3.022)         |                     |           101.064 (+-0.436)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        412.942 (+-57.066)        |                     |           461.463 (+-2.098)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |        1533.265 (+-4.071)        |                     |          1829.713 (+-14.311)

      channels=3, size=256, dtype=torch.int64, mf=channels_first      |        101.134 (+-0.924)         |                     |           102.858 (+-0.319)
      channels=3, size=520, dtype=torch.int64, mf=channels_first      |        421.679 (+-1.101)         |                     |           477.413 (+-1.809)
      channels=3, size=712, dtype=torch.int64, mf=channels_first      |        1550.418 (+-3.647)        |                     |           1877.143 (+-6.622)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         20.961 (+-0.063)         |   19.515 (+-0.302)  |            21.980 (+-0.070)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         71.199 (+-0.173)         |   70.199 (+-0.332)  |            95.262 (+-0.109)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        128.532 (+-0.318)         |  127.325 (+-0.328)  |           167.190 (+-0.370)

      channels=1, size=256, dtype=torch.int32, mf=channels_first      |         21.206 (+-0.059)         |   19.471 (+-0.128)  |            21.469 (+-0.064)
      channels=1, size=520, dtype=torch.int32, mf=channels_first      |         71.284 (+-0.163)         |   70.124 (+-0.388)  |            94.988 (+-0.239)
      channels=1, size=712, dtype=torch.int32, mf=channels_first      |        129.017 (+-0.286)         |  128.088 (+-0.461)  |           167.115 (+-1.075)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |         16.909 (+-0.057)         |   19.570 (+-0.353)  |            17.981 (+-0.072)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |         55.163 (+-0.138)         |   70.218 (+-0.275)  |           107.938 (+-0.620)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |         98.518 (+-0.121)         |  127.737 (+-0.486)  |           170.965 (+-0.436)

      channels=3, size=256, dtype=torch.uint8, mf=channels_first      |         18.150 (+-0.084)         |   19.758 (+-0.221)  |            18.122 (+-0.088)
      channels=3, size=520, dtype=torch.uint8, mf=channels_first      |         56.693 (+-0.200)         |   70.278 (+-0.386)  |            89.018 (+-0.206)
      channels=3, size=712, dtype=torch.uint8, mf=channels_first      |        100.409 (+-0.235)         |  127.772 (+-0.457)  |           168.072 (+-0.436)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         12.817 (+-0.041)         |                     |            12.818 (+-0.049)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         38.359 (+-0.081)         |                     |            63.378 (+-0.165)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         68.246 (+-0.090)         |                     |           116.637 (+-0.583)

      channels=1, size=256, dtype=torch.int16, mf=channels_first      |         12.899 (+-0.054)         |                     |            12.649 (+-0.060)
      channels=1, size=520, dtype=torch.int16, mf=channels_first      |         38.404 (+-0.069)         |                     |            63.448 (+-0.108)
      channels=1, size=712, dtype=torch.int16, mf=channels_first      |         68.378 (+-0.104)         |                     |           116.415 (+-0.332)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |         17.071 (+-0.044)         |                     |            17.792 (+-0.050)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |         55.163 (+-0.100)         |                     |           108.539 (+-0.466)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |         98.537 (+-0.091)         |                     |           171.675 (+-0.553)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         17.837 (+-0.071)         |                     |            18.355 (+-0.067)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         56.051 (+-0.087)         |                     |            88.261 (+-0.129)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        100.603 (+-0.245)         |                     |           169.067 (+-0.430)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         21.204 (+-0.063)         |   19.607 (+-0.140)  |            22.202 (+-0.094)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         71.356 (+-0.211)         |   69.844 (+-0.343)  |            94.614 (+-0.167)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        129.087 (+-0.290)         |  127.065 (+-0.319)  |           166.513 (+-0.444)

      channels=1, size=256, dtype=torch.float32, mf=channels_first    |         21.196 (+-0.065)         |   19.156 (+-0.132)  |            21.516 (+-0.073)
      channels=1, size=520, dtype=torch.float32, mf=channels_first    |         71.422 (+-0.180)         |   70.296 (+-0.136)  |            94.913 (+-0.095)
      channels=1, size=712, dtype=torch.float32, mf=channels_first    |        129.045 (+-0.312)         |  128.023 (+-0.585)  |           166.089 (+-0.409)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         12.770 (+-0.045)         |                     |            34.853 (+-0.089)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |         38.363 (+-0.064)         |                     |           131.969 (+-0.577)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |         67.954 (+-0.107)         |                     |           239.507 (+-0.835)

      channels=1, size=256, dtype=torch.float16, mf=channels_first    |         12.855 (+-0.067)         |                     |            35.124 (+-0.109)
      channels=1, size=520, dtype=torch.float16, mf=channels_first    |         38.725 (+-0.079)         |                     |           131.708 (+-0.586)
      channels=1, size=712, dtype=torch.float16, mf=channels_first    |         68.931 (+-0.086)         |                     |           239.022 (+-0.914)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |         90.277 (+-0.083)         |                     |           101.512 (+-0.285)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        421.277 (+-1.030)         |                     |           471.913 (+-3.654)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |        1534.394 (+-7.572)        |                     |          1833.262 (+-12.185)

      channels=3, size=256, dtype=torch.float64, mf=channels_first    |        100.809 (+-0.328)         |                     |           103.166 (+-0.335)
      channels=3, size=520, dtype=torch.float64, mf=channels_first    |        425.535 (+-0.926)         |                     |           482.606 (+-1.450)
      channels=3, size=712, dtype=torch.float64, mf=channels_first    |        1550.832 (+-3.547)        |                     |           1859.098 (+-6.517)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         12.954 (+-0.051)         |                     |            12.744 (+-0.046)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |         41.180 (+-0.064)         |                     |            63.362 (+-0.139)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |         68.136 (+-0.142)         |                     |           117.009 (+-0.292)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_first   |         13.049 (+-0.052)         |                     |            12.792 (+-0.076)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_first   |         38.488 (+-0.092)         |                     |            63.451 (+-0.096)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_first   |         68.103 (+-0.091)         |                     |           116.693 (+-0.290)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         7.572 (+-0.029)          |                     |            8.017 (+-0.071)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         22.121 (+-0.061)         |                     |            23.614 (+-0.074)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         36.896 (+-0.094)         |                     |            39.460 (+-0.084)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         7.671 (+-0.028)          |                     |            8.034 (+-0.058)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         21.989 (+-0.053)         |                     |            23.645 (+-0.063)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         37.252 (+-0.072)         |                     |            39.477 (+-0.100)

      channels=1, size=256, dtype=torch.complex64, mf=channels_last   |         37.129 (+-0.052)         |                     |            37.801 (+-0.101)
      channels=1, size=520, dtype=torch.complex64, mf=channels_last   |        122.646 (+-0.230)         |                     |           139.074 (+-0.467)
      channels=1, size=712, dtype=torch.complex64, mf=channels_last   |        228.946 (+-0.736)         |                     |           257.589 (+-0.545)

      channels=1, size=256, dtype=torch.complex64, mf=channels_first  |         37.088 (+-0.070)         |                     |            37.894 (+-0.078)
      channels=1, size=520, dtype=torch.complex64, mf=channels_first  |        122.695 (+-0.268)         |                     |           138.933 (+-0.336)
      channels=1, size=712, dtype=torch.complex64, mf=channels_first  |        234.655 (+-0.454)         |                     |           255.787 (+-0.530)

Times are in microseconds (us).
```
[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100440-pr_vs_nightly-md)

- AVX512 (all tested cases showing speed-up or same perfs)

```
[---------------------------------------------------------------------------- Vertical flip -----------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)   |  torch (1.14.0.dev20221208+cu116) nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |        122.544 (+-1.962)         |                     |             129.161 (+-1.809)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        508.274 (+-4.790)         |                     |             533.872 (+-7.457)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |        951.176 (+-29.534)        |                     |            1073.603 (+-44.676)

      channels=3, size=256, dtype=torch.int64, mf=channels_first      |        127.872 (+-2.700)         |                     |             127.326 (+-2.666)
      channels=3, size=520, dtype=torch.int64, mf=channels_first      |        518.019 (+-4.157)         |                     |             538.094 (+-6.600)
      channels=3, size=712, dtype=torch.int64, mf=channels_first      |       1002.176 (+-42.545)        |                     |            1033.989 (+-42.137)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         10.025 (+-0.135)         |   10.054 (+-0.369)  |              10.155 (+-0.285)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         89.867 (+-0.994)         |   88.712 (+-0.622)  |             103.029 (+-2.254)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        161.787 (+-2.080)         |  161.370 (+-1.801)  |             182.608 (+-7.031)

      channels=1, size=256, dtype=torch.int32, mf=channels_first      |         10.005 (+-0.277)         |   9.965 (+-0.338)   |              10.604 (+-0.334)
      channels=1, size=520, dtype=torch.int32, mf=channels_first      |         89.116 (+-0.996)         |   88.840 (+-0.608)  |             102.103 (+-2.111)
      channels=1, size=712, dtype=torch.int32, mf=channels_first      |        164.328 (+-3.284)         |  161.538 (+-2.739)  |             181.702 (+-3.770)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |         8.853 (+-0.148)          |   10.292 (+-0.494)  |              8.961 (+-0.190)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |         68.368 (+-1.158)         |   90.068 (+-1.780)  |              81.155 (+-0.945)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |        125.458 (+-2.511)         |  163.150 (+-2.532)  |             147.039 (+-4.264)

      channels=3, size=256, dtype=torch.uint8, mf=channels_first      |         10.409 (+-0.435)         |   10.406 (+-0.351)  |              10.263 (+-0.252)
      channels=3, size=520, dtype=torch.uint8, mf=channels_first      |         69.077 (+-1.062)         |   90.057 (+-0.992)  |              79.910 (+-0.884)
      channels=3, size=712, dtype=torch.uint8, mf=channels_first      |        127.286 (+-2.789)         |  162.862 (+-2.953)  |             142.821 (+-2.119)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         7.513 (+-0.143)          |                     |              7.364 (+-0.154)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         33.140 (+-0.779)         |                     |              42.141 (+-0.820)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         86.235 (+-1.187)         |                     |             104.205 (+-2.205)

      channels=1, size=256, dtype=torch.int16, mf=channels_first      |         7.410 (+-0.162)          |                     |              7.075 (+-0.126)
      channels=1, size=520, dtype=torch.int16, mf=channels_first      |         33.656 (+-0.914)         |                     |              40.991 (+-0.893)
      channels=1, size=712, dtype=torch.int16, mf=channels_first      |         86.087 (+-1.191)         |                     |             105.419 (+-1.801)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |         8.802 (+-0.196)          |                     |              8.627 (+-0.202)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |         66.348 (+-0.775)         |                     |              80.631 (+-1.832)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |        126.275 (+-2.318)         |                     |             144.597 (+-4.242)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         10.255 (+-0.383)         |                     |              10.101 (+-0.335)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         68.124 (+-0.849)         |                     |              79.286 (+-0.748)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        127.118 (+-2.225)         |                     |             142.029 (+-2.507)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         9.850 (+-0.453)          |   9.299 (+-0.253)   |              10.030 (+-0.234)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         91.506 (+-1.319)         |   90.265 (+-0.824)  |             107.570 (+-2.093)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        167.820 (+-3.883)         |  162.871 (+-2.397)  |             180.046 (+-8.952)

      channels=1, size=256, dtype=torch.float32, mf=channels_first    |         10.118 (+-0.359)         |   10.433 (+-0.479)  |              10.204 (+-0.344)
      channels=1, size=520, dtype=torch.float32, mf=channels_first    |         90.862 (+-1.486)         |   90.138 (+-0.969)  |             107.011 (+-1.801)
      channels=1, size=712, dtype=torch.float32, mf=channels_first    |        163.931 (+-3.653)         |  163.155 (+-2.673)  |             186.707 (+-2.248)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         7.304 (+-0.134)          |                     |              24.141 (+-0.444)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |         35.186 (+-0.656)         |                     |             101.523 (+-1.465)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |         85.707 (+-0.841)         |                     |             192.640 (+-4.942)

      channels=1, size=256, dtype=torch.float16, mf=channels_first    |         7.286 (+-0.142)          |                     |              24.155 (+-0.555)
      channels=1, size=520, dtype=torch.float16, mf=channels_first    |         33.819 (+-1.009)         |                     |             101.620 (+-3.034)
      channels=1, size=712, dtype=torch.float16, mf=channels_first    |         84.811 (+-0.993)         |                     |             192.286 (+-4.707)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |        126.273 (+-2.519)         |                     |             128.831 (+-1.975)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        551.861 (+-4.159)         |                     |             517.343 (+-4.501)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |       1102.465 (+-66.427)        |                     |            1224.532 (+-55.656)

      channels=3, size=256, dtype=torch.float64, mf=channels_first    |        129.965 (+-2.083)         |                     |             130.709 (+-2.261)
      channels=3, size=520, dtype=torch.float64, mf=channels_first    |        526.332 (+-5.354)         |                     |             515.399 (+-4.320)
      channels=3, size=712, dtype=torch.float64, mf=channels_first    |       1169.215 (+-78.889)        |                     |            1102.536 (+-51.178)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         7.478 (+-0.147)          |                     |              7.154 (+-0.162)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |         33.836 (+-1.022)         |                     |              38.854 (+-0.648)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |         85.483 (+-0.582)         |                     |              99.190 (+-2.202)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_first   |         7.416 (+-0.125)          |                     |              7.169 (+-0.121)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_first   |         34.958 (+-0.717)         |                     |              40.136 (+-0.784)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_first   |         85.505 (+-1.207)         |                     |              99.793 (+-2.065)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         5.856 (+-0.178)          |                     |              5.824 (+-0.118)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         12.030 (+-0.330)         |                     |              14.478 (+-0.554)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         30.116 (+-0.639)         |                     |              31.163 (+-0.873)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         5.804 (+-0.113)          |                     |              5.825 (+-0.102)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         12.043 (+-0.363)         |                     |              14.240 (+-0.341)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         30.001 (+-1.001)         |                     |              33.199 (+-0.430)

      channels=1, size=256, dtype=torch.complex64, mf=channels_last   |         29.941 (+-0.861)         |                     |              28.229 (+-0.904)
      channels=1, size=520, dtype=torch.complex64, mf=channels_last   |        173.244 (+-2.577)         |                     |             173.173 (+-2.260)
      channels=1, size=712, dtype=torch.complex64, mf=channels_last   |        323.548 (+-3.338)         |                     |             318.318 (+-2.764)

      channels=1, size=256, dtype=torch.complex64, mf=channels_first  |         29.001 (+-1.029)         |                     |              28.565 (+-2.074)
      channels=1, size=520, dtype=torch.complex64, mf=channels_first  |        173.078 (+-1.993)         |                     |             170.664 (+-1.722)
      channels=1, size=712, dtype=torch.complex64, mf=channels_first  |        324.782 (+-3.759)         |                     |             315.745 (+-2.600)

Times are in microseconds (us).

```

[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105707-pr_vs_nightly-avx512-md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89414
Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/albanD
2023-01-20 16:18:01 +00:00
387357539f Log accuracy failure in more cases (#92645)
Fixes https://github.com/pytorch/torchdynamo/issues/1910

But not durably, it's easy to forget if you add more cases.  I'd like
someone else to do that refactor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92645
Approved by: https://github.com/Chillee
2023-01-20 15:23:35 +00:00
64985123e4 Logcumsumexp for complex in CPU and CUDA (#90847)
Another PR towards solving #89205.
What's in this PR:

* The implementation of forward `logcumsumexp` for complex numbers in CPU & CUDA
* The tests on forward call of `logcumsumexp` for complex numbers
* The implementation of backward `logcumsumexp` for complex numbers

What's missing:

* The test on backward gradient of `logcumsumexp` (it complaints `RuntimeError: logcumsumexp does not support automatic differentiation for outputs with complex dtype.` and I don't know how to solve the error and I don't know where to put the test for the backward computation). If possible, I'd like this to be done in this PR.

It's really tricky to handle the edge cases here (i.e. the ones involving `inf`), but I've tried my best to put some comments explaining the reasonings of my decisions in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90847
Approved by: https://github.com/albanD
2023-01-20 15:10:50 +00:00
4386f317b9 Add meta kernel coverage for aten.unsafe_split, aten.unsafe_chunk (#92608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92608
Approved by: https://github.com/ngimel
2023-01-20 12:39:56 +00:00
274958ef43 [vmap] unsafe_split : batching rule and OpInfo (#92291)
Ref: https://github.com/pytorch/functorch/issues/1089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92291
Approved by: https://github.com/Chillee
2023-01-20 10:31:56 +00:00
f6acd95ae5 Fix performance smoke test script bug (#92660)
Fixes the file not found issue in https://github.com/pytorch/pytorch/actions/runs/3963775704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92660
Approved by: https://github.com/desertfire, https://github.com/huydhn
2023-01-20 06:46:13 +00:00
2a3954372a [Dynamo] Make torch.autograd.Function.forward support graph break and no re-compilation (#91295)
Fixes #91101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91295
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-01-20 06:25:09 +00:00
119d5e425c [Inductor] decompose expm1 for CPP vec (#92289)
For micro-bench op `aten.elu.default` in TIMM, the performance is not good even though with vectorization. `Elu` uses `expm1` as a sub-op. It turns out that inductor invokes sleef `expm1` function while aten decomposes it with `exp - 1`. The former one performs worse than the latter one. This PR decomposes `expm1` for cpp vectorization to make performance come back.

Performance data for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

suite | improved_ratio_speedup | speedup_old | RSD(3) | speedup_new | RSD(3)
-- | -- | -- | -- | -- | --
timm | 114.38% | 0.803447768 | 8.39% | 1.722458 | 27.74%

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92289
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-20 05:29:32 +00:00
38a4cb765b Torch package support in dynamo (#91821)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91821
Approved by: https://github.com/suo, https://github.com/malfet
2023-01-20 05:03:34 +00:00
773b513435 Add --timing flag, phase timing to @dynamo_timed (#92637)
Ex output:
```
 TIMING:
 entire_frame_compile:8.574629999999999
 backend_compile:5.26806
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92637
Approved by: https://github.com/ezyang
2023-01-20 05:01:21 +00:00
663bf4ba15 [vision hash update] update the pinned vision hash (#92270)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92270
Approved by: https://github.com/pytorchbot, https://github.com/malfet
2023-01-20 04:08:45 +00:00
1464db08b4 [quant][pt2e] Support setting qconfig by module_type (#92355)
Summary:
This PR supports the following feature for QConfigMapping:
```
qconfig_mapping = QConfigMapping().set_object_type(torch.nn.Conv2d, qconfig)
backend_config = get_qnnpack_pt2e_backend_config()
m = prepare_pt2e(m, qconfig_mapping, example_inputs, backend_config)
```
which means users want to set the qconfig for all calls to `torch.nn.Conv2d` to use `qconfig`, note this is only verified for the case when the module is broken down to a single aten op right now, e.g. torch.nn.Conv2d will be torch.ops.aten.convolution op when traced through. will need to support more complicated modules that is broken down to multiple operators later, e.g. (MaxPool)

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_qconfig_module_type

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92355
Approved by: https://github.com/jcaip
2023-01-20 03:18:21 +00:00
620846c8b4 Remove reference in dynamo benchmark makefile to triton master branch (#92663)
Triton changed the name of the master branch to main. Dynamo dashboard will likely break without this fix.

Tested on a new conda environment locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92663
Approved by: https://github.com/yanboliang
2023-01-20 03:09:53 +00:00
e9bc82f54b Vectorize torch.exp2 on CPU and add complex support (#92115)
I see an 11x speedup in `exp2` on CPU from this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92115
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-01-20 02:48:04 +00:00
52e8af57a6 [3/N] Update ema_teacher_arch in the backward call (#92080)
Summary: adding support for updating ema_teacher_arch in C2 backend

Test Plan:
baseline
f397096610

EMA run
f397096864

Differential Revision: D41124891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92080
Approved by: https://github.com/kit1980
2023-01-20 02:29:42 +00:00
f659452009 [FSDP][1/N] Split fully_shard unit tests (#92296)
This PR splits `test_fully_shard.py` into `fully_shard/test_fully_shard<...>.py`. This should help improve readability and avoid some future rebase conflicts.

The only other real change is resolving a `TODO` for using `run_subtests` in the model checkpointing unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92296
Approved by: https://github.com/mrshenli
2023-01-20 02:02:59 +00:00
59071ab1e7 [Executorch][Quantization][BE] Refactor Choose Qparams (#92592)
Summary: Should hopefully be a little faster. Definitely cleaner to not create an observer inside the op

Test Plan: ci

Differential Revision: D42154677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92592
Approved by: https://github.com/jerryzh168
2023-01-20 01:36:47 +00:00
cf5495ac3a Add perf check for inductor smoke test (#92358)
Background: performance smoke test job has been setup for inductor https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-smoke-test.yml
I have used this job to identify that https://github.com/pytorch/pytorch/pull/91254 regressed performance from 1.194x to 1.156x. However, this was by manual checking.

To automatically flag similar regressions, we will add a reference value (which needs to be actively maintained) so that any speedups falling below the reference would be treated regression.

In the back, two A100 instances from GCP would be running the perf check jobs for every push to upstream. So far these two instances give up to 1.204x and 1.197x, so we interpret any output below 1.185x to be suspicious.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92358
Approved by: https://github.com/ngimel, https://github.com/desertfire
2023-01-20 01:02:29 +00:00
493a6ced74 [fx] Throw error when symbolically tracing control flow ops (#92313)
Throws a better error when symbolically tracing control flow ops. Right now it throws an error when creating the function arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92313
Approved by: https://github.com/zhxchen17
2023-01-20 00:38:21 +00:00
4110900b22 let inductor generate broadcast when loading a single value (#92595)
For better perf with MLIR triton.
Changes
```
tmp32 = tl.load(seed3 + (0 + tl.zeros([XBLOCK, RBLOCK], tl.int32)), None)
```
to
```
tmp32_load = tl.load(seed3+(0)); tmp32 = tl.broadcast_to(tmp32_load, [XBLOCK, RBLOCK])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92595
Approved by: https://github.com/Chillee
2023-01-20 00:05:01 +00:00
f0e3c4929b only copy meta if available (#92623)
Test Plan:
```
buck2 test mode/opt //torchmultimodal/tests:tests -- --exact 'torchmultimodal/tests:tests - test_albef.py::test_albef_image_embeddings_momentum'
```
now passes

Reviewed By: malfet

Differential Revision: D42608385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92623
Approved by: https://github.com/tugsbayasgalan
2023-01-19 23:39:53 +00:00
60bf851931 Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 8383b5c488399f2ae295c7c0f993bdd353dfd75c.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/malfet due to This seems to have broke sm_86 testing, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=sm86%20%2F%20test%20(default%2C%203
2023-01-19 23:37:59 +00:00
550983e39d Revert "Move check_label ci to mergebot (#92309)"
This reverts commit 190f7803f5d90d027f331eaf48ef5fa63f14737a.

As it broke revert workflow, see https://github.com/pytorch/pytorch/actions/runs/3963235531/jobs/6790838677
2023-01-19 15:33:10 -08:00
190f7803f5 Move check_label ci to mergebot (#92309)
Fixes #88098

### What Changed
* Moved `check_label.py` logic into `trymerge.py`
* Refactored relevant unittests
* ~~Dropped~~ Refactored `check_label.py` ci job

### Tests
`python .github/scripts/test_trymerge.py`
`python .github/scripts/test_check_labels.py`
`make lint & lintrunner -a`

### Notes to reviewers
This PR replaces the [original PR](https://github.com/pytorch/pytorch/pull/92225) to workaround the sticky EasyCLA failure mark on its first commit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92309
Approved by: https://github.com/ZainRizvi
2023-01-19 22:31:32 +00:00
b33d9e2c87 Point to README.md#from-source instead of duplicate instructions in CONTRIBUTING.md#developing-pytorc (#91850)
**Idea:** [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) should be the place that describes how I as a developer builds from source.

Currently, `CONTRIBUTING.md` suggests an incomplete set of install instructions that predates those in `README.md`.

This PR tries to simplify and remove a dead end from the developer onboarding funnel by pointing to [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source).

### Details
Without touching this codebase for years I tried to build repo for local development and run unit tests. I tried to capitalise on the confusion by documenting it:
1. I go to [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source)
2. Since it doesn't suggest how I run unit test I follow [README.md#releases-and-contributing to ](https://github.com/pytorch/pytorch/blob/master/README.md#releases-and-contributing) to [CONTRIBUTING.md#developing-pytorch](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#developing-pytorch) which is written as if it's _the_ set up dev env instruction:
73e5379fab/CONTRIBUTING.md (L88-L90)
   But this section gives competing and incomplete install instructions that does not work for me. Ex, it doesn't mention `ninja` or `pyaml` required for `python setup.py develop`.
5. Going back to the original [README.md#from-source](https://github.com/pytorch/pytorch/blob/master/README.md#from-source) setup instructions that (mostly) worked.
73e5379fab/README.md (L187)

#### TODO

- [x] verify that it does not break any link to other documentation
[skip ci]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91850
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere
2023-01-19 22:14:28 +00:00
706aa51628 [dynamo] Support control flow map() operator. (#91939)
Fixes #ISSUE_NUMBER

We want to add support for control flow map() at dynamo level to unblock some internal model which will have to use map() operator in captured graph. Basically I replicate the pattern for implementing cond() op from https://github.com/pytorch/pytorch/pull/90286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91939
Approved by: https://github.com/ezyang
2023-01-19 22:03:01 +00:00
647b8f8e3e Add TORCH_CHECK_TENSOR_ALL (#89097)
`TORCH_CHECK_TENSOR_ALL(cond, ...)` is a wrapper around `TORCH_CHECK` which allows the condition argument to be a tensor, batched or unbatched. `cond` can be a boolean tensor of any size. If any element is False, or if `cond.numel() == 0`, then `TORCH_CHECK_TENSOR_ALL` raises an error

Part of #72948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89097
Approved by: https://github.com/zou3519
2023-01-19 21:04:09 +00:00
25e530083e [ci] Run test_decomp parallel (#92566)
run test_decomp in parallel with itself since it now takes 2+ hours on some architectures https://docs.google.com/spreadsheets/d/1o0W4WjOYIyPSzBSl3lelvKcQyLOiv8pMijiGUDoPuBU/edit#gid=0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92566
Approved by: https://github.com/huydhn
2023-01-19 20:47:27 +00:00
0998ec1e27 Revert 61cdae0ce58bcbe048b143356fd9ded821225657 to fix CI (#92631)
61cdae0ce5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92631
Approved by: https://github.com/malfet
2023-01-19 19:57:05 +00:00
a20c678c72 Rename Makefile_dashboard to Makefile (#92584)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92584
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-01-19 16:28:37 +00:00
90024436e7 Do not specialize int/float with dynamic=True (#92570)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92570
Approved by: https://github.com/bdhirsh
2023-01-19 16:27:45 +00:00
0bc875ac1d [dtensor] disable gpu tests in op db first (#92611)
There seems to be some issue with the cuda tests where our CI aren't
capturing those failures (probably because of lacking 4 GPUs in CI
environment). Disabling it first and debug later

see https://github.com/pytorch/pytorch/issues/92343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92611
Approved by: https://github.com/XilunWu
2023-01-19 16:20:00 +00:00
a2b8e891f6 Fix/modernize dynamo docs (#92572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92572
Approved by: https://github.com/ezyang
2023-01-19 16:15:31 +00:00
ce43fc586f Register sccache epilogue before starting sccache (#92587)
Fixing XLA test job flaky with sccache failing to start with a timeout error, for example:

* https://github.com/pytorch/pytorch/actions/runs/3953719143/jobs/6770489428
* https://github.com/pytorch/pytorch/actions/runs/3952860712/jobs/6769339620
* https://github.com/pytorch/pytorch/actions/runs/3946315315/jobs/6754126326

XLA test job actually builds XLA as part of the test ~~, so it needs sccache~~

* Register sccache epilogue before starting sccache, so that any errors when starting sccache can be printed
* Add `-e SKIP_SCCACHE_INITIALIZATION=1` to `_linux_test` workflow, this is the same flag used in `_linux_build` workflow. Quoted the reason from the build script:

> sccache --start-server seems to hang forever on self hosted runners for GHA so let's just go ahead and skip the --start-server altogether since it seems as though sccache still gets used even when the sscache server isn't started explicitly

* Also fix the code alignment in `.jenkins/pytorch/common-build.sh`
* We don't even use sccache in XLA test job, but there is an S3 cache used by bazel there (`XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92587
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2023-01-19 16:14:31 +00:00
44e52ea514 Reenable mobilevit_s in CI, seems to pass (#92585)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92585
Approved by: https://github.com/Chillee
2023-01-19 15:24:45 +00:00
b6cfd62285 vmap support for torch.linalg.vander (#91749)
Adds vmap support for torch.linalg.vander in a similar manner to how view_as_complex is implemented.

#91700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91749
Approved by: https://github.com/lezcano
2023-01-19 14:49:54 +00:00
3ba5eae72a [optim][radam] fix eps discrepancy for foreach (#92551)
Will likely race with https://github.com/pytorch/pytorch/pull/92365

eps was not being used at all in the mta/foreach impl. There was also a discrepancy between the docs vs the implementation: the implementation was doing sqrt(x) + eps and the docs were doing sqrt(x+eps)).

I've fixed the docs + extended the current multi_tensor test case to capture this issue.

![image](https://user-images.githubusercontent.com/31798555/213300617-61cbb763-da2d-48e0-b3b6-0190594dd049.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92551
Approved by: https://github.com/albanD
2023-01-19 14:38:59 +00:00
97f34e367d Run CI in a new environment (#92378)
Needed to be able to install newer Python versions (Python-3.11 in this case),  which do not have numerous packages that default environment must have

In addition, fix weird incursion of `conda-forge` by torch-deploy test.

Reincarnation of an old https://github.com/pytorch/pytorch/pull/66530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92378
Approved by: https://github.com/kit1980
2023-01-19 14:24:30 +00:00
ccbdf49582 [MPS] Fix index_select scalar input with multiple indices (#91064)
Support operations like this:

```
device="mps"
arr = torch.tensor(10, device=device)
indices = torch.tensor([0, 0], device=device)  # multiple indices
torch.index_select(arr, 0, indices)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91064
Approved by: https://github.com/kulinseth
2023-01-19 14:08:02 +00:00
827e22ec2d Revert "[vmap] unsafe_split : batching rule and OpInfo (#92291)"
This reverts commit 0510ae59b3168eb22422ee88b64419aeb0682782.

Reverted https://github.com/pytorch/pytorch/pull/92291 on behalf of https://github.com/kshitij12345 due to Broke trunk
2023-01-19 13:49:43 +00:00
a9f4462847 [primTorch] Remove prims.to_dtype (#92380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92380
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-01-19 12:07:47 +00:00
1906eaf22f [BE] Get rid of future (#92596)
PyTorch has been Python-3.X+ for ages, so it's a shame to still rely on `future.utils` even in a deprecated Caffe2 codebase

For the reference:
https://peps.python.org/pep-0469/#migrating-directly-to-python-3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92596
Approved by: https://github.com/kit1980, https://github.com/orionr
2023-01-19 08:46:50 +00:00
1bc60c6b31 [reland] Improve hooks ordering behavior (#92559)
This reverts commit e525f433e15de1f16966901604a8c4c662828a8a.

Original PR:  #85849
Fixes #ISSUE_NUMBER

In addition to reverting the revert, this PR:
- defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92559
Approved by: https://github.com/albanD
2023-01-19 08:17:32 +00:00
0510ae59b3 [vmap] unsafe_split : batching rule and OpInfo (#92291)
Ref: https://github.com/pytorch/functorch/issues/1089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92291
Approved by: https://github.com/Chillee
2023-01-19 06:34:45 +00:00
0a404fdd82 Follow up comments of PR #91531 (#92359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92359
Approved by: https://github.com/awgu
2023-01-19 06:12:01 +00:00
2066523508 Fix ShardedTensorMetadata.tensor_properties for Python 3.11 (#91795)
The `tensor_properties` field of the `ShardedTensorMetadata` dataclass is a reference to a `TensorProperties` object. However, the field is set to `field(default=TensorProperties())` instead of `field(default_factory=TensorProperties)`. This causes an error when using Python 3.11 or later:

```python
ValueError: mutable default <class 'torch.distributed._shard.sharded_tensor.metadata.TensorProperties'> for field tensor_properties is not allowed: use default_factory
```

This change in dataclass behavior was introduced in [bpo-44674: Use unhashability as a proxy for mutability for default dataclass __init__ arguments](https://github.com/python/cpython/pull/29867).

The current use of `default` instead of `default_factory` also means that all `ShardedTensorMetadata` objects created without specifying `tensor_properties` will share the same `TensorProperties` object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91795
Approved by: https://github.com/fduwjj
2023-01-19 04:21:05 +00:00
06d54b4061 [threaded_pg] fix the comments of MultiThreadTestCase (#92373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92373
Approved by: https://github.com/wz337
2023-01-19 03:42:54 +00:00
997de44100 [dtensor] delete lagging op db and update op db tests (#92290)
We are now in pytorch core so don't need lagging op db anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92290
Approved by: https://github.com/XilunWu
2023-01-19 03:42:54 +00:00
8383b5c488 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-19 03:14:54 +00:00
4f4b62e4a2 some fixes to get symbolic shapes working through inductor (#92320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92320
Approved by: https://github.com/ezyang
2023-01-19 03:09:02 +00:00
cac217c80a Fix key error formatting and move exc code to exc.py (#92593)
Fixes https://github.com/pytorch/torchdynamo/issues/1953 and moves exception formatting code from convert_frame.py to exc.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92593
Approved by: https://github.com/ezyang
2023-01-19 02:54:00 +00:00
ba6820574c Make run_dynamic_ci_skips_only.sh more generic (#92581)
Since the dynamic aot_eager CI skips list is very short now,
I find that I need to run this script with other flags now.
Make it more easy to change the flags.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92581
Approved by: https://github.com/bdhirsh
2023-01-19 02:24:13 +00:00
2a7a859d00 [CI] move parallelnative to periodic (experimental) (#92567)
This PR is more of an RFC asking whether we intend to maintain parallelnative in the long term or to allow it to become community-supported.

If we want to maintain parallelnative, then let's close this PR.
If we do not, then we should remove it from trunk workflows into periodic (or just remove entirely).

Why shouldn't we just allow it to continue on CI regardless?
It adds friction to development! If we do support it, I think the friction is good--it prevents users from breaking what we support! But if not, then it is just another job users have to wait for before landing or another vector for flakiness to arise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92567
Approved by: https://github.com/malfet
2023-01-19 01:46:48 +00:00
28cb3141e8 Remove temporary export skip hack (#92160)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92160
Approved by: https://github.com/SherlockNoMad, https://github.com/ezyang
2023-01-19 01:19:52 +00:00
ef2586422c fix promote_constants with ExpandView (#92403)
Fixes #92324
OpInfo, even with all samples, doesn't have this input ;-)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92403
Approved by: https://github.com/desertfire, https://github.com/eellison
2023-01-19 01:02:14 +00:00
bdbd3ed312 When nopython=True, Dynamo can't allow graph breaks. (#90970)
I count the number of sub-graphs (for tiny-GPT2 in huggingface) by
```
    class GraphCaptureCompiler:
        def __init__(self):
            self.captured_graphs = []
        def compile(self, gm, example_inputs):
            self.captured_graphs.append(gm)
            return gm
    compiler = GraphCaptureCompiler()
    torch._dynamo.optimize(compiler, nopython=True)(Wrapper(fn))(*args)
```

Although `len(compiler.captured_graphs)` is 2, no error was thrown during the compilation. This observation conflicts with `nopython=True`. After some digging, I found a check is missed before making graph break. This PR adds it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90970
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/thiagocrepaldi
2023-01-19 00:59:33 +00:00
eb39d990ce Guard on at::Tensor device index (#91779)
Fixes #91777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91779
Approved by: https://github.com/ngimel
2023-01-19 00:58:04 +00:00
388d79ccda [CI] valgrind 3.16.1->3.20.0 (#92552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92552
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-01-19 00:42:50 +00:00
bb7790781f Make aot_autograd explicitly error when double backward (#92348)
Mitigates https://github.com/pytorch/pytorch/issues/91469

Changes:
- ~once_differentiable can now be parametrized to print a custom error message~
- instead of once_differentiable, we do the backward inside another custom Function, which makes sure the graph is connected, but also makes sure to error on double backward
- we now explicitly error when doing double backward with torch.compile + aot_autograd instead of being silently incorrect. ~The niceness of the error message can vary depending on whether your grad_outputs are passed, or whether you are doing `.grad()` or `.backward()`.~

Unchanged:
- doing backward inside compiled function is still allowed. It currently causes a graph break and is equivalent to doing backward outside the compiled function. It might be nice to disallow this explicitly as well, but that can be done in a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92348
Approved by: https://github.com/albanD
2023-01-19 00:13:29 +00:00
62eeb7d60f [PTD][Oncall] Sync Reorder structure for compatibility with linux-6.0 and gloo submodule for PT (#92568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92568
Approved by: https://github.com/kumpera
2023-01-19 00:01:59 +00:00
34353a402e [mergebot] Flatten workflows into jobs, fix bugs (#92097)
* flatten the workflows into just jobs in order to give more specific links (link to the specific job that failed instead of just pull), this should make it easier to implement bypass certain failures in the future
* try catch of MandatoryChecksMissingError from find_matching_merge_rule should fix error where merge loops instead of raising runtime error when trunk job fails
* remove usage of on_green and mandatory_only flags just in case.  on_green and force are the only two behaviors we currently use
* fail if ghstack pr has non ghstack change, tested locally with #92177 but unsure how to write tests b/c requires use of repo._run_git
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92097
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-18 23:38:16 +00:00
8b861544f9 Remove lowering and decompositions of zero_, zero, zeros_like... in favour of their references (#92071)
The generated triton code is identical.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92071
Approved by: https://github.com/ngimel
2023-01-18 23:22:36 +00:00
b5c3b4a36c Fix dynamo.export(aten=True) for condition op (#92361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92361
Approved by: https://github.com/voznesenskym
2023-01-18 23:17:22 +00:00
c5cb46ecdb [optim][asgd] group tensors in foreach to maximize perf (#92364)
faster foreach
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92364
Approved by: https://github.com/albanD
2023-01-18 23:09:55 +00:00
5fdddbbfe8 Fix checking of current mode in PyOperator dispatch (#92357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92357
Approved by: https://github.com/voznesenskym
2023-01-18 23:08:36 +00:00
f8a07ca422 Reland 2nd attempt "Add heirachical module names to torchFX graph.node" (#91721)
Fixes #87659

Reland of PR #87742 and PR #90205

PR #90205 was reverted due to BC issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91721
Approved by: https://github.com/jerryzh168
2023-01-18 23:00:36 +00:00
76cb2d0ede fix incorrect _embedding_bag meta (#92549)
Fixes https://github.com/pytorch/pytorch/issues/92286. See the issue for diagnosis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92549
Approved by: https://github.com/albanD, https://github.com/eellison
2023-01-18 22:50:31 +00:00
5aa3740d63 Change references to pytorch/functorch to the torch.func APIs (#92543)
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92543
Approved by: https://github.com/albanD
2023-01-18 22:50:17 +00:00
fbafcecf8d [optim][radam] group tensors in foreach to maximize perf (#92365)
Also noticed that eps is not being used nor tested at all for the mta impl of RAdam.

Will fix in a followup PR before turning foreach to default!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92365
Approved by: https://github.com/albanD
2023-01-18 22:32:27 +00:00
de459bdfaa [optim][rmsprop] group tensors in foreach to maximize perf (#92369)
Test plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92369
Approved by: https://github.com/albanD
2023-01-18 22:28:52 +00:00
07800c52af [optim][adam] group tensors in foreach to maximize perf (#92349)
same idea as https://github.com/pytorch/pytorch/pull/92338
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92349
Approved by: https://github.com/albanD
2023-01-18 22:05:42 +00:00
e2433e420c [optim][adamax] group tensors in foreach to maximize perf (#92363)
make foreach faster
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92363
Approved by: https://github.com/albanD
2023-01-18 21:32:28 +00:00
92d412d684 [FSDP][optim_state_dict][11/N] Let FSDP support NamedOptimizer/KeyedOptimizer when use_orig_params is False (#92184)
Current design of FSDP only support NamedOptimizer/KeyedOptimizer when use_orig_params is True this PR adds the support even if use_orig_params if False. This PR also adds the support for user-defined optimizer states -- states that are not associated with any particular parameters.

Differential Revision: [D42497416](https://our.internmc.facebook.com/intern/diff/D42497416/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92184
Approved by: https://github.com/colin2328, https://github.com/rohan-varma
2023-01-18 21:24:30 +00:00
befe3b68de Revert "Clean up C++14 code (#92216)"
This reverts commit dfbdfb276eb5b0492b39036f1c49c196b826587f.

Reverted https://github.com/pytorch/pytorch/pull/92216 on behalf of https://github.com/atalman due to fails internal build
2023-01-18 21:24:23 +00:00
4450424b8e Reduce some ambiguity in Tensor (#92266)
Summary:
A lot of other libraries have their own `xyz::Tensor` data structure. Under some rare cases, when they interop with torch, there will be compilation error such as
```
torch/csrc/api/include/torch/data/samplers/random.h(49): error: "Tensor" is ambiguous
```
Making some of the `Tensor` namespace clear will resolve this.

Test Plan: CI

Differential Revision: D42538675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92266
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-01-18 21:09:35 +00:00
8770a7ed6f Decompose more inplace ops (#90967)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90967
Approved by: https://github.com/anijain2305
2023-01-18 21:07:47 +00:00
0d65a10a2d [inductor] run CPU tests when CUDA is available (#92220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92220
Approved by: https://github.com/ezyang
2023-01-18 21:05:49 +00:00
dc1c0f78e2 Remove dead TORCHDYNAMO_DYNAMIC_SHAPES print (#92547)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92547
Approved by: https://github.com/albanD
2023-01-18 20:57:52 +00:00
3481ad3365 Make log parser work on inference runs too (#92546)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92546
Approved by: https://github.com/albanD
2023-01-18 20:57:52 +00:00
6420fecdc4 Introduce sym_min and sym_max (#92107)
It turns out our old max/min implementation didn't do anything, because `__max__` and `__min__` are not actually magic methods in Python. So I give 'em the `sym_` treatment, similar to the other non-overrideable builtins.

NB: I would like to use `sym_max` when computing contiguous strides but this appears to make `python test/functorch/test_aotdispatch.py -v -k test_aot_autograd_symbolic_exhaustive_nn_functional_max_pool2d_cpu_float32` run extremely slowly. Needs investigating.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92107
Approved by: https://github.com/albanD, https://github.com/voznesenskym, https://github.com/Skylion007
2023-01-18 20:57:27 +00:00
b26efd0dd2 Run bazel jobs on 4xlarge (#92340)
After the previous fix to limit the CPU and memory used by Bazel, I see one case today where the runner runs out of memory  in a "proper" way with exit code 137 0c8f4b5893.  So, the memory usage must be close to limit of an 2xlarge instance.  It makes sense to preemptively use 4xlarge now (like XLA)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92340
Approved by: https://github.com/clee2000
2023-01-18 20:14:56 +00:00
bb34461f00 [optim][rprop] group tensors in foreach to maximize perf (#92372)
this one had a few more for loops than i was expecting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92372
Approved by: https://github.com/albanD
2023-01-18 20:03:11 +00:00
b92a7afed9 Reclassify some dynamic aot_eager failures as static failures (#92376)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92376
Approved by: https://github.com/Chillee
2023-01-18 19:27:11 +00:00
ae4ec7de1e Fix and update type hints for make_functional.py (#91579)
Changes in details:

- Fix and update some out-of-date type hints in `_functorch/make_functional.py`.
- ~Explicitly use `OrderedDict` for order-sensitive mappings.~

	In `create_names_map()`, `_swap_state()`, and `FunctionalModuleWithBuffers.__init__()`, the unordered `dict` was used. The key order should be preserved for `dict.items()` while it is required to `zip` with a tuple of `params`/`buffers`. Although since Python 3.6, the built-in dictionary is insertion ordered ([PEP 468](https://peps.python.org/pep-0468)). Explicit is better than implicit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91579
Approved by: https://github.com/zou3519
2023-01-18 19:16:32 +00:00
bcd9f189f4 Remove setup-python on Windows CI and use Conda instead (#92183)
This has been bugging me for a while what Windows CI stills has the `setup-python` step in its setup.  The python setup here is not used by the build and the test steps at all, but are there to provide a python3 interpreter for `actions/get-workflow-job-id` and `actions/filter-test-configs`.  ~~As these 2 actions are generic and should be smart enough to check for conda setup and use that instead of system python.~~

Having `setup-python` contributes a bit to network flakiness on Windows where it fails to download stuffs from GitHub.  Example failures:

* https://github.com/pytorch/pytorch/actions/runs/3913257969/jobs/6690485582
* https://github.com/pytorch/pytorch/actions/runs/3930859163/jobs/6722743854
* https://github.com/pytorch/pytorch/actions/runs/3918415654/jobs/6699239557
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92183
Approved by: https://github.com/ZainRizvi
2023-01-18 18:07:40 +00:00
65056845d3 Update clang-tidy to 15.0.6 (#92195)
Based on results from https://github.com/pytorch/test-infra/pull/1382

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92195
Approved by: https://github.com/Skylion007
2023-01-18 17:00:13 +00:00
74bc894ede [BE] Delete unused args during docker build (#92396)
Such as `TRAVIS_DL_URL_PREFIX`, `JENKINS_UID`/`JENKINS_GID` and `EC2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92396
Approved by: https://github.com/huydhn, https://github.com/janeyx99
2023-01-18 15:41:00 +00:00
e525f433e1 Revert "Improve hooks ordering behavior (#85849)"
This reverts commit 049838f2496bd1d29e4e8292714acb0042cc706e.

Reverted https://github.com/pytorch/pytorch/pull/85849 on behalf of https://github.com/albanD due to fails internal build
2023-01-18 15:27:22 +00:00
7f0d321d2e Add missing gc untrack for cpp autograd Nodes (#92351)
Fixes https://github.com/pytorch/pytorch/issues/91161 the assertion after the warning seems to be linked to the fact that we didn't untrack this properly.
In 3.11 they added a warning when this is not called properly before tp_free
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92351
Approved by: https://github.com/ezyang
2023-01-18 15:23:48 +00:00
0070c546b5 [BE][optim] abstract out docstrings, add differentiable docs (#92336)
1. abstract out common doc strings --> I'm sure there are more, but let this be a first step.
2. Add differentiable docs to those who are actually differentiable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92336
Approved by: https://github.com/albanD
2023-01-18 15:09:28 +00:00
0035340488 Allow DDP to handle custom dataclass forward outputs (#92334)
Differential Revision: [D42554973](https://our.internmc.facebook.com/intern/diff/D42554973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92334
Approved by: https://github.com/zhaojuanmao
2023-01-18 14:51:37 +00:00
5d01277fea Deprecate torch.nn.utils.stateless.functional_call (#92280)
This PR:
- Updates the docs to say it is deprecated
- Raises a UserWarning
- Changes most of the callsites inside PyTorch to use
torch.func.functional_call, minus the test_stateless testing.

The motivation behind this is that we can now align behind a single
functional_call API in PyTorch.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92280
Approved by: https://github.com/albanD
2023-01-18 14:26:25 +00:00
a8a44a1aa2 Add deprecation messages for functorch.* function transforms (#92279)
This PR:
- adds deprecation warnings when calling the functorch APIs
- adds documentation saying that those APIs are deprecated

It does this by creating thin wrappers around the original APIs that (1)
raise deprecation warnings and (2) have an additional line in their
documentation that they are deprecated.

NB:
- Python surpresses DeprecationWarning, so we use UserWarning instead.

Test Plan:
- New tests
- the functorch.* APIs are still tested for correctness because that's
what test/functorch/* use (as opposed to directly calling the
torch.func.* APIs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92279
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-01-18 14:26:25 +00:00
21d2bd782b stack_module_state should return unrelated parameters (#92278)
`torch.func.stack_module_state` is our replacement for
`functorch.combine_state_for_ensemble`. The most common usage for
combine_state_for_ensemble is to
- create stacked parameters and buffers
- use vmap to run the forward pass
- use regular PyTorch autograd to run the backward pass (e.g.,
Tensor.backwrd)
- optimize directly over the stacked parameters (this is more performant
than optimizing over the unstacked parameters).

Right now, stack_module_state returns stacked parameters that cannot be
optimized directly (only leaf tensors can have a .grad field); this PR
fixes that by turning the stacked parameters back into leaf tensors.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92278
Approved by: https://github.com/soulitzer
2023-01-18 14:26:22 +00:00
3aa6cec18c [dynamo] exclude reset_rng_state when measure timing (#92237)
Fixes inductor performance regression on CPU: https://github.com/pytorch/torchdynamo/issues/2027, https://github.com/pytorch/torchdynamo/issues/2028 and https://github.com/pytorch/torchdynamo/issues/2029.
The details are explained here: https://github.com/pytorch/torchdynamo/issues/2028#issuecomment-1381496678.

### Performance

- Model: lennard_jones
- Machine: IceLake (32 cores per socket)
- Configuration: single instance, 32 cores per instance
- jemalloc and iomp enabled

```bash
python benchmarks/dynamo/torchbench.py  --inductor-settings --inductor --performance --float32 -dcpu -n5000  --no-skip --dashboard --only=lennard_jones --quiet
```

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

Time before   regression | Time after regression | Time with this PR
-- | -- | --
0.00020483799744397402 | 0.0002818034990923479 | 0.00020241099991835654

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92237
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-18 13:17:28 +00:00
f0b592dae7 Make masked_fill reference traceable (#90972)
As the comment states, `item()` cannot be used since you can't trace through a
scalar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90972
Approved by: https://github.com/ngimel
2023-01-18 10:54:42 +00:00
368c737603 [PT-D][5/N] Enable add_param_group for named optimizer (#91928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91928
Approved by: https://github.com/rohan-varma
2023-01-18 10:53:31 +00:00
61a7618f3c [Quant][Eager] Copy MHA's batch_first attribute in prepare() (#91680)
**Summary**
Fixes #91571
MHA's batch_first attribute is not copied after `torch.quantization.prepare()`. Now we copy MHA's batch_first attribute in torch/ao/nn/quantizable/modules/activation.py: `MultiheadAttention.from_float()`.

**Test plan**
python test/test_quantization.py -k test_mha_batch_first_attr_is_copied_in_prepare

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91680
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-01-18 10:49:05 +00:00
206f4e47bb Replace exp(x) - 1 with expm1(x) (#92154)
This offers improved precision near zero where `exp(x)` is `1 + O(x)` and doing
`(1 + O(x)) - 1` will truncate anything below the float epsilon to zero.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92154
Approved by: https://github.com/lezcano
2023-01-18 10:43:57 +00:00
4058dedf21 Replace log(1 + x) with log1p(x) (#92114)
`log1p` offers better precision near zero since `(1 + x) - 1` truncates any
values less than the float epsilon to zero. For `soft_margin_loss` this also
requires one fewer kernel invocation which for numel=1e7 gives me a 1.2x speedup
on CUDA and a 1.1x speedup on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92114
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-01-18 10:43:56 +00:00
5a2ae8805c [Quant] onednn backend switch to ideep new api without affacting performance (#91056)
> Reopen of https://github.com/pytorch/pytorch/pull/90354

**Summary**
Onednn quantization backend switch to new API in `third_party/ideep`.
- `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly.
- Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated.
- Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`.
- For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch.
- Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc.

It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.

Performance results are given below.
1. End-to-end performance of static quantized models (from torchvision)
(throughput: fps, higher is better)
![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png)

2. Op benchmark of dynamic quantized linear
(Latency: ms, lower is better)
![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png)

Test method & env:
- Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
- Run multi-instances on a single node. Use one core for each instance.
- Use Jemalloc and Intel OpenMP

**Test plan**
python test/test_quantization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91056
Approved by: https://github.com/jgong5
2023-01-18 09:53:34 +00:00
fb50a4b4ce [Inductor] added aten.exponential_ decomp (#91673)
Fixes #91276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91673
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2023-01-18 09:19:35 +00:00
4a4520e74b Retire unsafe sparse tensor constructors in Python API (#91331)
This PR removes sparse tensor constructor functions `torch._sparse_coo/csr/csc/bsr/bsc/compressed_tensor_unsafe(...)` as unneeded. The equivalent functionality is provided via `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor(..., check_invariants=False)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91331
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-01-18 08:55:22 +00:00
cyy
dfbdfb276e Clean up C++14 code (#92216)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92216
Approved by: https://github.com/ezyang
2023-01-18 08:14:54 +00:00
c55f6973e4 [dtensor][3/N] move OpSchema and types to a separate file (#90732)
This PR moves OpSchema and types to a separate file to resolve
circular dependency better, this is part of refactor on dispatching
logic to enable more complicated features
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90732
Approved by: https://github.com/XilunWu
2023-01-18 07:16:23 +00:00
dc95ef25e5 [dtensor][2/N] add __repr__ to placements (#91785)
This PR added __repr__ to all placement types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91785
Approved by: https://github.com/XilunWu
2023-01-18 07:16:23 +00:00
a1186d6af9 [dtensor][1/N] add __hash__ to device_mesh and dtensor_spec (#90731)
This PR adds __hash__ to device_mesh and dtensor_spec to allow
things like dict indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90731
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-01-18 07:16:21 +00:00
bc9af74c99 Clear references to user tensors after compilation is finished (#92353)
Fixes https://github.com/pytorch/torchdynamo/issues/2033
and https://github.com/pytorch/torchdynamo/issues/2005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92353
Approved by: https://github.com/eellison
2023-01-18 06:43:30 +00:00
387ca598a1 [nn] full_backward{_pre}_hook: warning for Module returning dict, list, etc (#87547)
Fixes https://github.com/pytorch/pytorch/issues/87540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87547
Approved by: https://github.com/albanD
2023-01-18 06:28:00 +00:00
3868eeb75f fix biasadd OMP perf issue for the packed MKL SGEMM (#92300)
Currently the biasadd of MKL SGEMM was executed using OpenMP macro, this will lead to a performance issue if the SGEMM size is very small (e.g., M = 1, K = 80, N = 256) when we are using many threads.
The reason is that in such case `num_task < num_thread`, and the task cost is too small (e.g., ~1-2 cycles for memcpy), the thread synchronization cost would be very large. Thus it is better to use `at::parallel_for` to run on the main thread directly.
Packed MKL SGEMM (1x80x256) | OpenMP biasadd | `at::parallel_for` biasadd
-- | -- | --
Latency | 2000 us | 21 us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92300
Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5
2023-01-18 06:14:11 +00:00
bb11e072ae Squash and merge linalg meta kernels (#92335)
Squashed changes from https://github.com/pytorch/pytorch/pull/92021 and https://github.com/pytorch/pytorch/pull/92020 and https://github.com/pytorch/pytorch/pull/92019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92335
Approved by: https://github.com/avikchaudhuri
2023-01-18 05:55:52 +00:00
0d4bbd1996 [Lint] Add FSDP/composable API files to ufmt include (#90873)
This PR adds FSDP and composable API files to `.lintrunner.toml` so that (1) lintrunner enforces that those files are formatted and (2) `lintrunner f` formats those files for you.

There are two requirements here (see https://github.com/pytorch/pytorch/wiki/lintrunner for details):
1. Install lintrunner:
```
pip install lintrunner
lintrunner init
```
2. `lintrunner f` before you finalize your PR, which would now be enforced by CI after this PR.

The code changes in this PR outside of `.lintrunner.toml` are the result of `lintrunner f`.

---

I only plan to land this PR if all of the composable API developers agree that this is something that makes sense and is not too intrusive to the workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90873
Approved by: https://github.com/yhcharles, https://github.com/mrshenli, https://github.com/rohan-varma
2023-01-18 05:33:34 +00:00
9b173b87b2 Refactor away leftover import indirection (#92188)
This indirect ways of importing are a leftover from when we wanted to support both `import torchdynamo` and `import torch._dynamo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92188
Approved by: https://github.com/desertfire
2023-01-18 04:53:05 +00:00
a414b7f367 Make clone-deps checkout correct Triton hash (#92345)
Fixes https://github.com/pytorch/pytorch/issues/92326

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92345
Approved by: https://github.com/albanD
2023-01-18 04:46:51 +00:00
6fa86d7402 Add @chillee to codeowners for functorch tests (#92337)
^
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92337
Approved by: https://github.com/zou3519
2023-01-18 04:44:24 +00:00
94a7c01159 Enable oneDNN implementation in LSTM op (#91158)
### Description
This PR is to enable oneDNN implementation in LSTM op to improve the performance of it. Both FP32 and BF16 are supported.

### Performance improvement
In CPX 28C, with setting iomp and jemalloc.
We choose 8 LSTM input options (including input_size, hidden_size, num_layers, bidirectional, bias, batch_first, dropout, batch_size, seq_len), and the final option is a real input from train-clean-100 in LibriSpeech dataset. The performance improvements are shown in the following figures. We can see that LSTM with oneDNN implementation can perform better than the original.

In single socket:
![image](https://user-images.githubusercontent.com/61222868/211182994-833debec-518a-4b35-8504-6b0fadb17930.png)

![image](https://user-images.githubusercontent.com/61222868/211183012-31e1253f-2c60-4c92-a656-c239a971b453.png)

In single core:
![image](https://user-images.githubusercontent.com/61222868/211183017-186e5d47-cb9a-4c1e-914f-fa718e769f1c.png)

![image](https://user-images.githubusercontent.com/61222868/211183022-53266857-5a9e-4a95-b300-33fa34811d08.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91158
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-01-18 04:41:18 +00:00
a41f00ed70 [optim][sgd] group tensors in foreach to maximize perf (#92338)
Make foreach faster for SGD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92338
Approved by: https://github.com/albanD
2023-01-18 04:02:41 +00:00
98b78aa11c [autograd.Function] setup_context always appears on the Function (#92312)
Previously, we used the existence of setup_context to switch between if
forward should take a ctx object or not.

To be consistent with all other staticmethod (which always exist on the
autograd.Function), this PR change it so that we use IF setup_context
gets overriden by the user to switch between if forward should take a
ctx object or not.

Fixes https://github.com/pytorch/pytorch/issues/91451

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92312
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-01-18 02:55:42 +00:00
00fe63d1d8 fx Graph should copy meta on deepcopy (#92062)
Summary:
fx Graph should copy meta on deepcopy

Test Plan:
Unit test

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92062
Approved by: https://github.com/zhxchen17
2023-01-18 02:49:14 +00:00
60fe2f4420 Revert "Torch package support in dynamo (#91821)"
This reverts commit 3726d232191088e8e7a9c1a2ab3244cdd9250bf2.

Reverted https://github.com/pytorch/pytorch/pull/91821 on behalf of https://github.com/huydhn due to The change causes flakiness on trunk. See https://github.com/pytorch/pytorch/issues/92196#issuecomment-1386368909 for more details
2023-01-18 02:17:25 +00:00
cf5a40c2b4 Only warn about fallbacks once per graph (#92211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92211
Approved by: https://github.com/eellison
2023-01-18 01:44:43 +00:00
30f2026863 [inductor] Promote half-precision CPU constants to float (#91224)
Currently `aten.where` can fail with the following C++ compiler error:
```
error: operands to '?:' have different types 'c10::Half' and 'float'
```

This happens because `ops.load` is overridden to cast Half inputs to float, but
`ops.constant` will load a Half without promoting to float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91224
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ngimel
2023-01-18 01:04:36 +00:00
764f79f680 [Microbenchmark] microbench fix for triton template (#92282)
Fixes microbench bug due to triton template https://github.com/pytorch/pytorch/pull/91575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92282
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-01-18 00:58:00 +00:00
88366a9075 Document hooks ordering behavior in the autograd note (#91667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91667
Approved by: https://github.com/albanD
2023-01-18 00:20:13 +00:00
388b245d54 Expose autograd.graph.Node as an abstract base class (#91475)
This PR:
- registers all of the codegened Nodes to the torch._C._functions module, this is where special nodes like AccumulateGrad are already registered.
- creates a autograd.graph.Node abstract base class that all of the newly registered nodes subclass from. We make the subclassing happen by implementing the ``__subclasshook__`` method
- enables static type checking to work and also enables Sphinx to generate documentation for the Node and its methods
- handles both the custom Function and codegened cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91475
Approved by: https://github.com/albanD
2023-01-18 00:20:13 +00:00
0157e2ef4e [optim][adamw] default to foreach when CUDA + differentiable=False (#92306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92306
Approved by: https://github.com/albanD
2023-01-18 00:13:50 +00:00
fcde6dbbac [onnx] Add mse_loss symbolic (#90717)
Adds support for mse_loss operator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90717
Approved by: https://github.com/BowenBao, https://github.com/titaiwangms, https://github.com/abock
2023-01-18 00:04:59 +00:00
40d6f2a020 Update sdp_utils to check gradmode and subclassed tensors (#92323)
# Summary
Fix up the grad check test to check for subclassed tensors and gradmode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92323
Approved by: https://github.com/soulitzer
2023-01-17 23:14:21 +00:00
68f8042064 Bypass filament2 for new pytorch random distribution method (#92190)
Summary: After D41587318 introduced new pytorch randomization, filament2 training failed due to chunk size is 0. We gated the new change to external only to fix filament2 package

Test Plan: f402461641 the flow has training successfully finished

Reviewed By: izaitsevfb

Differential Revision: D42501726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92190
Approved by: https://github.com/izaitsevfb
2023-01-17 22:36:24 +00:00
b31905c727 Fix Windows cpu_profiling_allocator_test same pointer check flakiness (#92264)
This is a small follow-up from https://github.com/pytorch/pytorch/pull/91727 to fix the flaky same pointer check on Windows https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20CPUAllocationPlanTest.with_profiling_alloc.  AFAICT, keeping the same memory pointer is not a guarantee in non-mobile memory allocator (or may be this is Windows-specific behavior).

The test might be flaky when the tensor is copied to a different memory location with the default allocator.  This's ok as long as the values remain equal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92264
Approved by: https://github.com/ZainRizvi
2023-01-17 22:22:35 +00:00
16f9d1bb83 [torch.func] Add migration guide from functorch (#91811)
Test Plan:
- view preview

Future:
- still need to figure out the make_fx situation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91811
Approved by: https://github.com/albanD
2023-01-17 22:14:42 +00:00
89f1ad08b4 Revert "Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)"
This reverts commit 7f256fff77c49729131aa6d092e60e891d0c4948.

Reverted https://github.com/pytorch/pytorch/pull/88078 on behalf of https://github.com/huydhn due to This breaks lint 7f256fff77
2023-01-17 22:14:37 +00:00
7f256fff77 Improve bsr @ strided performance in baddmm for bfloat16/half with Triton kernels. (#88078)
As per title.

Additionally we also introduce support for:
- Rectangular block sizes which are powers of 2 and at least 16 (triton's `dot` limitation).
- Batch support with broadcasting for either of the arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88078
Approved by: https://github.com/cpuhrsch
2023-01-17 21:43:20 +00:00
befe815466 Revert "Add sym_size/stride/numel/storage_offset to native_function.yaml (#91919)"
This reverts commit 0388400f3f8a8ecae2f809ba40ca3ddd5a8b9028.

Reverted https://github.com/pytorch/pytorch/pull/91919 on behalf of https://github.com/atalman due to Break internal build
2023-01-17 21:03:18 +00:00
88942a3199 Revert "[FSDP] Do not clean FQNs even for use_orig_params=True (#91767)"
This reverts commit d6f3265e1add26abedb504910be93b393b9fb33c.

Reverted https://github.com/pytorch/pytorch/pull/91767 on behalf of https://github.com/malfet due to Looks like it broke `test_compatible_with_named_optimizer` distribued tests, see d6f3265e1a
2023-01-17 20:04:52 +00:00
0c8f4b5893 Update Module.__setattr__ to respect property setters (#92044)
Fixes #52664. Checks if the attribute is a property that defines a setter and uses fset in __setattr__ rather than registering an inaccessible module / parameter.

This is BC-breaking as the attribute setters on nn.Module properties used to be ignored and now will be called properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92044
Approved by: https://github.com/albanD
2023-01-17 20:00:06 +00:00
4fc796daf9 [optim] abstract out _default_to_foreach_util (#92305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92305
Approved by: https://github.com/albanD
2023-01-17 19:42:20 +00:00
5c9c39a83f Revert "[fx] rewrite FloorDiv to match Python better (#90906)"
This reverts commit d13207c7adf7f94620b1228dab547ff253c46d0b.

Reverted https://github.com/pytorch/pytorch/pull/90906 on behalf of https://github.com/malfet due to eca_halonext26ts started failing after 2nd PR from the stack  was landed, see 88b3810c94, not sure which one of the two caused it
2023-01-17 19:26:38 +00:00
013afc5abe Revert "[fx] fix type promotion in binary_magic_impl (#91376)"
This reverts commit 88b3810c94b45f5982df616e2bc4c471d173f491.

Reverted https://github.com/pytorch/pytorch/pull/91376 on behalf of https://github.com/malfet due to eca_halonext26ts  started failing after this was landed, see 88b3810c94
2023-01-17 19:04:04 +00:00
933cc67e7e [pytorch] [compososable] make contract() pickle-able through functools wraps (#92120)
Summary:
make contract() pickle-able through functools wraps.
This is to get functions wrapped with contract() to work with torch package

Differential Revision: D42491056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92120
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/rohan-varma, https://github.com/mrshenli
2023-01-17 18:14:05 +00:00
ea1007b89c Run dynamo/test_dynamic_shapes serially (#92215)
Per my findings in https://github.com/pytorch/pytorch/issues/92196#issuecomment-1383029544

> The test itself dynamo/test_dynamic_shapes is not flaky and all passes when I try to run it locally. However, this test is set to run in parallel with other tests on the runner (2 tests at a times). After many tries, I can only reproduce the issue once when dynamo/test_dynamic_shapes is run in parallel with test_comparison_utils

After many retries, I could reproduce the issue once locally when running (https://paste.sh/_mFImq6V#FgbKq6IQBg65PKUFA08Ah_Vb)

```
python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests -i test_comparison_utils dynamo/test_dynamic_shapes
```

So setting this test to run serially to avoid further flakiness while the root cause is investigated.

Here are some example flaky failures:

* https://github.com/pytorch/pytorch/issues/92196
* https://github.com/pytorch/pytorch/issues/92178
* https://github.com/pytorch/pytorch/issues/92042
* https://github.com/pytorch/pytorch/issues/92210

The test takes 30s or so to finish, so its duration is not a concern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92215
Approved by: https://github.com/clee2000
2023-01-17 17:54:39 +00:00
2eaa7a25d0 Fix model accuracy issue caused by vectorized transpose (#92299)
Fix accuracy issues from models: jx_nest_base, cait_m36_384, XLNetLMHeadModel, Super_SloMo
https://github.com/pytorch/torchdynamo/issues/2038
https://github.com/pytorch/torchdynamo/issues/2037
https://github.com/pytorch/torchdynamo/issues/2036
https://github.com/pytorch/torchdynamo/issues/2035

The inner loop list should be newly created in loop.clone().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92299
Approved by: https://github.com/desertfire
2023-01-17 17:53:45 +00:00
d29f0ba74d Fix philox randn to follow standard normal distribution (#91945)
Fixes #91944
Related #91207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91945
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-01-17 17:48:25 +00:00
d6f3265e1a [FSDP] Do not clean FQNs even for use_orig_params=True (#91767)
Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91767
Approved by: https://github.com/zhaojuanmao
2023-01-17 17:41:28 +00:00
1439cb0314 [FSDP][optim_state_dict][9/N] Rewrite the all-gather flow of optimizer state to support older GPUs (#91343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91343
Approved by: https://github.com/rohan-varma
2023-01-17 17:21:19 +00:00
46a81c8db7 Deprecate .mT,.T,.mH,.H on 0D tensors (#92143)
As discussed with @ngimel, this is not only not documented,
but also an unnecessary edge case. See https://github.com/pytorch/pytorch/pull/90463#discussion_r1064807197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92143
Approved by: https://github.com/ngimel
2023-01-17 16:54:35 +00:00
66e498626c Perform first the decomposition and then the ATen function to catch in-place modifications (#92243)
Addresses https://github.com/pytorch/pytorch/pull/91672#discussion_r1070412867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92243
Approved by: https://github.com/ezyang
2023-01-17 16:53:36 +00:00
77b8aa6e43 Wrap a few more functions to ease their tracking during debugging (#92004)
Yup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92004
Approved by: https://github.com/ezyang
2023-01-17 16:53:36 +00:00
ea8b14f27e Add a test for decompositions that decomposes all the operations as much as possible (#87182)
This will enable a more thorough testing of the decompositions than the
one just provided by OpInfos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87182
Approved by: https://github.com/ezyang
2023-01-17 16:53:34 +00:00
d162c8f92b Assorted decomposition fixes (#87183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87183
Approved by: https://github.com/ngimel
2023-01-17 16:53:31 +00:00
da58f9eb8f Rewrite out-of-place decompositions in terms of out-of-place ops (#92003)
Fixes https://github.com/pytorch/torchdynamo/issues/1863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92003
Approved by: https://github.com/ngimel
2023-01-17 16:53:27 +00:00
1d47c59384 Check in some utility scripts for running dynamic shapes sweeps (#92256)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92256
Approved by: https://github.com/albanD
2023-01-17 16:37:13 +00:00
049838f249 Improve hooks ordering behavior (#85849)
Addresses: https://github.com/pytorch/pytorch/issues/35802

Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit#

### Changes in this PR

#### Implementation
- We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed.
- Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated.

#### Hooks ordering and execution:
- Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn
- Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad()
- Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook

#### Retains grad hooks
- retains grad hooks always execute last, even if there are other tensor pre-hooks registered

#### Unchanged:
- pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad()

Follow ups:
- simplify retains_grad field to not be a vector, since it always holds a single hook
- potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since
- python hooks registered to tensor behavior on in-place is still wrong

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85849
Approved by: https://github.com/albanD
2023-01-17 16:23:21 +00:00
fb1427ea8f squeeze: allow squeezing multiple dimensions at once (#89017)
Ref #70924

This addresses part 1 of the issue, allowing `torch.squeeze` to be
passed a tuple of dimensions. e.g.
```python
x.squeeze(0).squeeze(0)
```
can now be written
```python
x.squeeze((0, 1))
```
(assuming x has at least 2 dimensions)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89017
Approved by: https://github.com/albanD
2023-01-17 14:20:15 +00:00
fbf9e379e1 [autograd.Function] update error messages for vmap to point to docs (#92030)
We need to separately update it when 2.0 comes along and the master docs
become stable docs so that users aren't looking at master docs all the
time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92030
Approved by: https://github.com/soulitzer
2023-01-17 13:36:42 +00:00
81cc9bba5e [autograd.Function] Kill the extension feature flag (#92026)
This PR removes the autograd.Function extension feature flag. This was
previously used for development of the functorch <> autograd.Function
interaction.

It's been in master for long enough with the feature flag defaulting to
True, so it's time to remove it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92026
Approved by: https://github.com/soulitzer
2023-01-17 13:36:42 +00:00
7aaad0b832 Rename flag that enables/disables _SingleLevelFunction for functorch (#92025)
functorch used to have a switch that enables/disables autograd.Function.
That switch now enables/disables torch.autograd.function._SingleLevelFunction, so
I've renamed it accordingly.

We could just delete the switch because users should not be directly
working with torch.autograd.function._SingleLevelFunction. However,
it was useful for debugging when something went wrong when I was
implementing the autograd.Function <> functorch interaction, so I want
to keep it around as a debugging tool for a while since the code is
already there.

Test Plan:
- updated tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92025
Approved by: https://github.com/soulitzer
2023-01-17 13:36:41 +00:00
14ff58d4fa [generate_vmap_rule] Delete unused output_shapes (#92024)
We don't actually need `output_shapes` to implement
`generate_vmap_rule=True` support for autograd.Function.
- We need this in the vjp (backward) case because autograd automatically
  reduces grad_inputs to inputs and we need to replicate that behavior.
  In order to replicate that behavior, we recorded the original input
  shapes so we know how to reduce the grad_input.
- There is no such behavior for forward-mode AD, so we don't need to
  pass an `output_shapes` to reductify.

This PR simplifies the API of `reductify` and `reductify_leaf`. Instead
of accepting `input_shape_without_bdim` and `allow_expanded_grad`, we
now combine these into a single argument,
`reduce_to_input_shape_without_bdim`.
- if it is None, then we don't do anything
- if it is not-None and a shape, then we will reduce the grad to the
  provided shape.

Test Plan:
- updated original unittests
- wait for test suite
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92024
Approved by: https://github.com/soulitzer
2023-01-17 13:36:39 +00:00
f5af97ef06 [autograd.Function] add nice error message for incorrect usage of vmap (#92023)
This PR:
- adds a nice error message if the user doesn't follow the API of the
  vmap staticmethod correctly. That is, the user must return two
  arguments from the vmap staticmethod API: (outputs, out_dims), and
  out_dims must be a PyTree with either the same structure as `outputs`
  our be broadcastable to the same structure as `outputs`.
- Fixes an edge case for out_dims=None. out_dims is allowed to be None,
  but wrap_outputs_maintaining_identity was treating "None" as "This is
  not the vmap case"

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92023
Approved by: https://github.com/soulitzer
2023-01-17 13:36:37 +00:00
2f9166ef89 [autograd.Function] Cleanup asymmetry in generate_vmap_rule and vmap (#91787)
This PR:
- changes generate_vmap_rule to either be True or False. Previously it
  could be True, False, or not set. This simplifies the implementation a
  bit.
- changes the vmap staticmethod to always be on the autograd.Function
  rather than sometimes defined.
  This is how the other staticmethod (forward, backward, jvp) are
  implemented and allows us to document it.

There are 4 possible states for the autograd.Function w.r.t. to the
above:
- generate_vmap_rule is True, vmap staticmethod overriden. This raises
  an error when used with vmap.
- generate_vmap_rule is False, vmap staticmethod overriden. This is
  valid.
- generate_vmap_rule is True, vmap staticmethod not overriden. This is
  valid.
- generate_vmap_rule is False, vmap staticmethod not overriden. This
  raises an error when used with vmap.

Future:
- setup_context needs the same treatment, but that's a bit tricker to
  implement.

Test Plan:
- new unittest
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91787
Approved by: https://github.com/soulitzer
2023-01-17 13:36:34 +00:00
88b3810c94 [fx] fix type promotion in binary_magic_impl (#91376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91376
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-01-17 10:04:38 +00:00
d13207c7ad [fx] rewrite FloorDiv to match Python better (#90906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90906
Approved by: https://github.com/ezyang
2023-01-17 10:04:38 +00:00
5e0d3458eb Move XLA test job to 4xlarge (#92269)
Per the discussion with @clee2000 , I'm trying to look into XLA flaky failures.  It's tricky because the runner crashes losing all the logs.  The only guess I have comes from the test insight information of XLA test job, i.e. https://hud.pytorch.org/test/insights?jobName=linux-bionic-py3_7-clang8-xla%20%2F%20test%20(xla%2C%201%2C%201%2C%20linux.2xlarge)&workflowId=3919472559&jobId=10650151864

* Memory looks fine. It peaks at ~14GB when building, then dropping when testing
* CPU spikes at 100% at the end, which I suspect to be the reason causing the runner to crash

So the fix is to try to limit the test to nCPU - 1, so there is always one core left for the runner.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92269
Approved by: https://github.com/malfet
2023-01-17 06:43:21 +00:00
ded2b47bde Fix AOTAutograd 2.0 perf regression involving as_strided (#92255)
I feel there may be a deeper fix where we avoid as_strided entirely, but in the regressed model the sizes/strides all lined up exactly, so this seems to work to fix the immediate regression.

Repro command: `python benchmarks/dynamo/torchbench.py  --performance  --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert  `

Before: 1.138x p=0.00
After: 1.162x p=0.00

Natalia pinpointed it to this line by comparing GPU traces and finding that the regressed PyTorch had two extra fill kernels and a memcpy:

Without regression:
![image](https://user-images.githubusercontent.com/13564/212726521-450e183d-7b36-4538-ad14-617e09c689a8.png)

With regression:
![image](https://user-images.githubusercontent.com/13564/212726469-4f3ff4b5-3f68-48cf-94d2-ddebb9216176.png)

...which CPU profiler blamed on `AsStridedBackward`:

![image](https://user-images.githubusercontent.com/13564/212726953-16333bfc-8460-4445-90ad-7fe73c4173c2.png)

...which were then pinpointed to  https://github.com/pytorch/pytorch/pull/92076/files#diff-df954bbf954d2dcb81f687876053267ffa4ddb36ed86b7d2bd76319ff2b94416R486-R489

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92255
Approved by: https://github.com/ngimel, https://github.com/bdhirsh
2023-01-17 06:07:37 +00:00
cyy
9b716a0682 Clean up more clang-tidy supression (#92203)
1. remove unused NOLINTNEXTLINE(performance-move-const-arg)
2. add more std::move

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92203
Approved by: https://github.com/Skylion007
2023-01-17 05:43:08 +00:00
bbce4184be Refactor inductor to use standard BACKENDS dict (#92187)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92187
Approved by: https://github.com/desertfire
2023-01-17 04:05:43 +00:00
0388400f3f Add sym_size/stride/numel/storage_offset to native_function.yaml (#91919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919
Approved by: https://github.com/ezyang
2023-01-17 03:39:57 +00:00
801d831d7a [dtensor] enable op db tests by using multithreaded test case (#92198)
Time comparison between using MultithreadedTestCase and MultiProcessTestCase on op db tests is amazing!

using MultiThreadTestCase on a AWS dev node:
```
time pytest test/distributed/_tensor/test_dtensor_ops.py

============= 175 passed, 42 skipped, 397 xfailed in 80.30s (0:01:20) =======

real    1m22.330s
user    1m38.782s
sys     0m18.762s
```
MultiProcessTestCase spends from 40mins to more than 1h, even if using pytest parallel testing tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92198
Approved by: https://github.com/XilunWu
2023-01-17 03:26:38 +00:00
2ce63ef26c [dtensor] switch pointwise op tests to use DTensorOpsTestBase (#92197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92197
Approved by: https://github.com/XilunWu
2023-01-17 03:26:38 +00:00
e16979c9a0 [threaded_pg] full rewrite of MultiThreadedTestCase to enable device_type tests (#91650)
This PR did a full rewrite of MultiThreadedTestCase, to make it more
aligned with the MultiProcessTestCase, also changed how it do spawning
and testing, so that we could embed thread local states when running
tests.

This PR enables device_type tests to work with MultiThreadedTestCase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91650
Approved by: https://github.com/XilunWu
2023-01-17 03:26:36 +00:00
9942ddd5b3 [threaded_pg] enable subpg creation and concurrent collective (#91649)
This PR refactors the threaded PG logic to enable multiple sub pg
creation under the world threaded pg, and allow the case where
we can call collectives together on different subpgs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91649
Approved by: https://github.com/XilunWu
2023-01-17 03:26:34 +00:00
85edb58179 Fix oneDNN double checkout issue and Upgrade oneDNN to v2.7.3 (#92239)
### Descriotion

This PR is to fix oneDNN double checkout issue that mentioned in https://github.com/pytorch/pytorch/pull/87061#issuecomment-1284384276, and upgrade oneDNN to v2.7.3 to fix #92138.

### Performance test

Use TorchBench test in ICX with 40 cores
Intel OpenMP & jemalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/212634378-b91c20b5-0e85-474f-861c-c1d2f6962de1.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92239
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-01-17 01:54:21 +00:00
d62eff56bd Fix typos introduced by 014ac7fda2d1e59796b1147221fb92f4377ca2f1
Also rename `Facebook CLA` to `EasyCLA`

Test Plan: `python3 test_trymerge.py` passes
2023-01-16 17:38:45 -08:00
014ac7fda2 Add ROCm merge rules (#85762)
Adds jeffdaily as approver needed to merge any changes to ROCm or HIP-related files in PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85762
Approved by: https://github.com/malfet
2023-01-17 00:45:12 +00:00
eadbf762fc Fix CUDA error not getting captured by handler (#92227)
Fixes #91758. Still leaves functions on the hotpath.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92227
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-01-17 00:16:29 +00:00
32937f39f4 Don't raise error if job_id can't be fetched (#92192)
But always return `workflowi_d`, which is not unique across reruns but it's better than failing the entire run just because API call failed. Test it locally by feeding the program an incorrect input and observe the failure.
Fixes https://github.com/pytorch/pytorch/issues/91332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92192
Approved by: https://github.com/kit1980
2023-01-17 00:09:05 +00:00
301644d3cb [ROCm] disable NVFuser (#92182)
In preparation for #89621.

Partial reverts of #82498 and #86369.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92182
Approved by: https://github.com/davidberard98
2023-01-16 18:35:12 +00:00
0b90ddacd9 Unit test for is_causal Better Transformers (#91900) (#92102)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/91900

Test Plan:
buck test  :test_transformers -- -r test_train_with_is_causal
buck test mode/opt :test_transformers -- -r test_is_causal_gpu
flake8 test_transformers.py

Differential Revision: D42453642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92102
Approved by: https://github.com/drisspg
2023-01-16 17:25:06 +00:00
b05f509601 Add missing conversion for to_sparse.sparse_dim (#92006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92006
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-01-16 15:34:10 +00:00
523d4f2562 Revert "[cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527)"
This reverts commit 4d07ad74f1c11efa55501433d6cf1f06840f5207.

Reverted https://github.com/pytorch/pytorch/pull/91527 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-16 13:28:09 +00:00
1a98c3e36c Revert "Add kwargs support to torch.export() API (#92013)"
This reverts commit 890b68281a3eb3e5c5762d5f51bacd91fdfa89d8.

Reverted https://github.com/pytorch/pytorch/pull/92013 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-16 13:03:48 +00:00
76c88364ed [inductor] Respect dtype argument in ops.constant (#92093)
Consider the following example:

```python
def fn(x):
    y = torch.full_like(x, 1.2, dtype=torch.int64)
    return x + y
```

In eager this truncates 1.2 to 1, then adds it to `x`. However, in
inductor the literal "1.2" is used verbatim and the result is off by
0.2. This fixes the issue by respecting the dtype argument to `ops.constant`
and truncating accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92093
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-01-16 12:53:47 +00:00
5a0fa04a49 Add MTIA DeviceType for Meta training and inference devices (#92232)
Summary: This adds a new MTIA DeviceType which is associated with the MTIA DispatchKey and will be used for the Meta in-house training and inference accelerators.

Test Plan: All CI should pass.

Differential Revision: D42526044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92232
Approved by: https://github.com/ezyang
2023-01-16 12:20:23 +00:00
9cf8434776 [ONNX] Raise Unsupported for Grid Sample with volumetric 5D input (#92212)
Fixes #92209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92212
Approved by: https://github.com/BowenBao
2023-01-16 03:34:05 +00:00
85e0fd0280 [FSDP][BE] Improve device_id + CPU offload test (#92031)
Closes https://github.com/pytorch/pytorch/issues/83054. The new version of the test ensures that the parent FSDP instance has managed parameters to trigger the `module.to(device_from_device_id)` call, which moves the child FSDP instance's managed parameters (and hence must be hackily moved back).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92031
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
2023-01-16 02:38:10 +00:00
5a3b4dacad [FSDP][BE] Rename prefixed_param_names -> fqns for consolidation (#92028)
Closes https://github.com/pytorch/pytorch/issues/90961.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92028
Approved by: https://github.com/zhaojuanmao
2023-01-16 02:38:10 +00:00
b0888cce0f [FSDP][BE] Better error msg for incorrect device for training (#92027)
Closes https://github.com/pytorch/pytorch/issues/90541.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92027
Approved by: https://github.com/zhaojuanmao
2023-01-16 02:38:07 +00:00
b5d8fef9a5 [DTensor] remove redundant device mesh test code (#92069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92069
Approved by: https://github.com/wanchaol
2023-01-16 01:17:45 +00:00
513c1e71e2 [DTensor] check DeviceMesh ranks contiguity (#91802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91802
Approved by: https://github.com/wanchaol
2023-01-16 01:17:45 +00:00
2293a6b95e [BE] Refactor get_workflow_job_id (#92191)
A noop change that refactors existing codebase and prints a bit more
verbose error message when request fails.

Get rid of `requests` as it inevitable results in flakiness

TODO: Remove in a few days after PR is landed
4af5939d7a/.github/actions/get-workflow-job-id/action.yml (L29)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92191
Approved by: https://github.com/kit1980
2023-01-15 23:02:29 +00:00
1da0ac2c93 Enable -Werror=bool-operation (#92221)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92221
Approved by: https://github.com/Skylion007
2023-01-15 20:49:53 +00:00
bc4c324807 Remove variable_excluded_from_dispatch() assertion from mkldnncommon (#92168)
When tracing a model using dynamo, theses assertions fail. Following https://github.com/pytorch/pytorch/pull/29653 and https://github.com/pytorch/pytorch/pull/46371, we think it might be OK to remove these two assertions as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92168
Approved by: https://github.com/ezyang
2023-01-15 01:40:10 +00:00
d41b5d7c14 [adam] Add not torch.jit.is_scripting() as a requirement for switching to fused (#92181)
A "fix" following https://github.com/pytorch/pytorch/pull/90865. Realized that fused is not compatible with torch.jit.is_scripting() when looking at a later line.

Took the opportunity to make the code cleaner/slightly more performant (with the extends) as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92181
Approved by: https://github.com/albanD
2023-01-14 19:05:27 +00:00
da43584bef [Reland] Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#92081)
Summary:
X-link: https://github.com/facebookresearch/d2go/pull/459

Reland of D41690203 (370df963e0)

Remove MobileOptimizerType and all rewrite flags from torch.X and torch._C.X to clean up torch.X and torch._C.X namespaces

The affected rewrite flags are
- CONV_BN_FUSION
- FUSE_ADD_RELU
- HOIST_CONV_PACKED_PARAMS
- INSERT_FOLD_PREPACK_OPS
- REMOVE_DROPOUT
- VULKAN_AUTOMATIC_GPU_TRANSFER

Bc-Breaking Change:

Before this change, the rewrite flags were accessible through all of
1. torch.utils.mobile_optimizer.MobileOptimizerType.X
2. torch._C.MobileOptimizerType.X
3. torch.X
4. torch.MobileOptimizerType.X
5. torch._C.X

But after this change, only torch.utils.mobile_optimizer.MobileOptimizerType.X  (option 1 above) and the newly added torch._C._MobileOptimizerType.X remain

Corresponding updates to PyTorch Tutorial Docs are in https://github.com/pytorch/tutorials/pull/2163

Test Plan:
```buck test caffe2/test:test_mobile_optimizer```
```
Summary
  Pass: 6
  Skip: 1
    ↻ caffe2/test:test_mobile_optimizer - test_mobilenet_optimize_for_mobile (test_mobile_optimizer.TestOptimizer)
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124793514412
```
___
```buck test caffe2/torch/fb/mobile/tests:model_exporter_tests```
Tests pass
___

With temporary testing changes in D41690204:

```buck run caffe2:test_rewrite_flags_api```
Before:
```
torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch._C' has no attribute '_MobileOptimizerType')
torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
```
After:
```
torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch._C' has no attribute 'MobileOptimizerType')
torch.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER')
torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch' has no attribute 'MobileOptimizerType')
torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch._C' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER')
```

```buck test caffe2/test:public_bindings -- test_no_new_bindings```
```
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7881299473114294
```

Reviewed By: SS-JIA

Differential Revision: D42442395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92081
Approved by: https://github.com/albanD
2023-01-14 17:06:00 +00:00
55f0ed6dcd [inductor] Fix an issue causing "Could not generate fp64 outputs" (#92036)
Summary: Fix a fp64 version of model failed-to-run issue when convert_element_type
appears in the model. The failure can cause some numerical difference
recognized as accuracy error since the fp64 baseline result is not
available, and thus distracts Minifier from finding a real culprit for
accuracy error.

See the discussion in https://github.com/pytorch/torchdynamo/issues/1812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92036
Approved by: https://github.com/ngimel
2023-01-14 17:03:27 +00:00
353e9f883f Add name attribute to ValueRangeAnalysis (#92121)
This is expected when used within InterpreterShim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92121
Approved by: https://github.com/eellison
2023-01-14 12:07:52 +00:00
cyy
a0626c356d Cleanup std::move (#91987)
fix use after move and remove unnecessary lint suppression
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91987
Approved by: https://github.com/Skylion007
2023-01-14 08:17:03 +00:00
1490dc6421 Revert "[BE] meow (#92174)"
This reverts commit 3debb97084484c3ebbba65e5fcbc2a60b77f0b47.

Reverted https://github.com/pytorch/pytorch/pull/92174 on behalf of https://github.com/ezyang due to oh yeah i think the print is intentional graph break
2023-01-14 07:32:39 +00:00
dfabb91614 [LTC] Use DataCache in GetIrValueForScalarFromCodegen (#92066)
Summary:
XLA expects GetIrValueForScalarFromCodegen to use DataCache such that not every scalar will request a data transfer to the backend device.

This needs pytorch/xla#4447 to verify.

Test Plan:
PJRT_DEVICE=CPU python xla/test/test_operations.py -v -k test_cached_addcdiv

Fixes pytorch/xla#4213.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92066
Approved by: https://github.com/JackCaoG
2023-01-14 05:38:06 +00:00
3debb97084 [BE] meow (#92174)
:')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92174
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-01-14 05:36:47 +00:00
421f40e051 Use binary units for CUDA memory summary (#91854)
To reduce confusion, use for example `KiB` instead of `KB` since we're talking powers of 2 and not 10.

https://en.wikipedia.org/wiki/Byte#Multiple-byte_units

```
import torch
x = torch.zeros(1024 * 1024, dtype=torch.uint8, device='cuda')
print(torch.cuda.memory_summary())
```

```
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|       from large pool |      0 KiB |      0 KiB |      0 KiB |      0 B   |
|       from small pool |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|---------------------------------------------------------------------------|
| Active memory         |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|       from large pool |      0 KiB |      0 KiB |      0 KiB |      0 B   |
|       from small pool |   1024 KiB |   1024 KiB |   1024 KiB |      0 B   |
|---------------------------------------------------------------------------|
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91854
Approved by: https://github.com/ngimel
2023-01-14 05:10:51 +00:00
b8057aa16d Remove unnecessary copies of Scalars for TensorBody template (#92162)
Inspired by #92156 , I realized our generated TensorBody.h has many methods that do an unnecessary copies. Scalar is backed by a ptr and is therefore not trivially copyable and care should be assigned over ownership of the params. Since it's a template, clang-tidy was never run on it in a way that was able to propogate the changes back to the source code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92162
Approved by: https://github.com/ezyang
2023-01-14 03:38:03 +00:00
3a0053abd6 Move PyObject code out of TensorImpl into new PyObjectSlot class (#92169)
Redo of PR #92099

Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92169
Approved by: https://github.com/albanD
2023-01-14 02:55:32 +00:00
7568484d54 [torchgen] Add CI job to cover custom ops registration for Executorch (#91291)
As titled. To register a custom op into Executorch, we need:

* `custom_ops.yaml`, defines the operator schema and the corresponding native function.
* `custom_ops.cpp`, defines the kernel.
* `RegisterDispatchKeyCustomOps.cpp`, a template to register operator into PyTorch.

Added a new test for custom ops. The custom op `custom::add_3.out` takes 3 tensors and add them together. The test makes sure it is registered correctly and then verifies the outcome is correct.

Differential Revision: [D42204263](https://our.internmc.facebook.com/intern/diff/D42204263/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91291
Approved by: https://github.com/ezyang
2023-01-14 02:30:54 +00:00
66b324cf06 Revert "In inductor triton generated code, avoid masking when numel=1 (#91254)"
This reverts commit 4e21fc2075e09dd735746696f95dce093b634c16.

Reverted https://github.com/pytorch/pytorch/pull/91254 on behalf of https://github.com/ngimel due to regresses perf of hf models
2023-01-14 01:39:10 +00:00
d3765509df [optim][adadelta] default to foreach when CUDA + differentiable=False (#91896)
following up to https://github.com/pytorch/pytorch/pull/90865 and https://github.com/pytorch/pytorch/pull/92048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91896
Approved by: https://github.com/albanD
2023-01-14 01:21:33 +00:00
cb67d9460b [PT-D] Fix send, recv return type (#92152)
- `send` returns `None`.
- `recv` returns the sender rank if valid or -1 otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92152
Approved by: https://github.com/wz337
2023-01-14 01:09:49 +00:00
4af5939d7a [optim] Improve adadelta foreach, group tensors to maximize fast path (#92048)
Old behavior would have adadelta foreach sending tensors to the slow path if they were not all the same dtype nor on the same device.

This PR adds grouping for adadelta optimizer so that it would run foreach in batches, allowing more users to benefit from foreach perf.

Of course, we should ensure that the new implementation works, so there are new tests to ensure this behavior is not broken.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92048
Approved by: https://github.com/albanD
2023-01-14 00:35:14 +00:00
3779a75fc9 Apply noexcept to relevant move methods to improve performance (#92156)
This clang-tidy check is disabled globally due to false positives on containers, but there are a few places here where adding clang-tidy would actually improve performance (by allowing STL containers to use the move operator / assignment)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92156
Approved by: https://github.com/ngimel
2023-01-14 00:17:26 +00:00
901a34ccb5 Add the new unstable workflow (#92106)
It's empty at the moment, but would tentatively include ROCm trunk jobs.  This adopts the same practice we have for inductor where it's run for every commit on trunk, and on PR with `ciflow/unstable` label

- [x] Allow `ciflow/unstable` as a valid tag https://github.com/pytorch/test-infra/pull/1394
- [x] Create the unstable workflow on PyTorch https://github.com/pytorch/pytorch/pull/92106
- [ ] Gather reliability metrics of ROCm runner
- [ ] Decide if we want to move ROCMs trunk jobs to the unstable workflow
- [ ] Add redness metrics for the unstable workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92106
Approved by: https://github.com/ZainRizvi
2023-01-13 23:53:23 +00:00
3794b4643f [GHF] Record how many times PR is revered (#92180)
Or merged, by adding "revertedX2","revertedX3",... labels

Tested in https://github.com/malfet/deleteme/pull/36

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92180
Approved by: https://github.com/ZainRizvi, https://github.com/kit1980
2023-01-13 23:18:38 +00:00
70b3ea59ae [ROCM] Modify transcoding: absolute path ->relative path (#91845)
Fixes https://github.com/pytorch/pytorch/issues/91797
This PR compiles the transcoded file with a relative path to ensure that the written transcoded file is written to SOURCE.txt as a relative path. Ensure successful packaging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91845
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2023-01-13 23:00:57 +00:00
214c0fdc4b MYPYNOFOLLOW for test_utils (#92136)
lintrunner went from 10 minutes to 25 minutes after 333540a458d40603feea84d30e4ad9b96b07318d since test/test_utils.py imports op_db, which takes 10+ minutes to run mypy on, so switch it to the the group of files that doesn't follow imports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92136
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-01-13 22:57:04 +00:00
04689ae209 [CI][ROCm] skip multiprocessing tests that trigger hangs (#92101)
Skip tests affected by #90940.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92101
Approved by: https://github.com/huydhn
2023-01-13 22:39:00 +00:00
4d07ad74f1 [cuDNN][cuDNN V8 API] Always build assuming cuDNN >= 8.0 (#91527)
We've been building with V8 (incl. V8 API) by default for a while now; this PR cleans up some guards for cuDNN < 8.0.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91527
Approved by: https://github.com/ngimel
2023-01-13 18:55:37 +00:00
4d26903739 Revert "Pytorch-bot test (#92163)"
This reverts commit 7fe3c64bdb6dc5aa969230ce0b10a9869849b49e.

Reverted https://github.com/pytorch/pytorch/pull/92163 on behalf of https://github.com/clee2000 due to undo the test
2023-01-13 18:43:05 +00:00
7fe3c64bdb Pytorch-bot test (#92163)
test to try to make pytorch-bot not a first time contributor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92163
Approved by: https://github.com/huydhn
2023-01-13 18:39:47 +00:00
f4b804eeaa Call profiler step via optimizer post hook (#90101)
This PR adds the `_profile_using_dynolog` function to `torch/__init__.py`. The `_profile_using_dynolog` method allows registering the optimizer step post hook. This is required to collect iteration based traces using dynolog.

Other related changes for tests to pass:
1. Updated `optimizer.pyi`
1. Updated `overrides.py`
1. The test `test_kineto_profiler_multiple_steppers` in `test_profiler.py` has been broken down into two cases:
     - `test_kineto_profiler_multiple_steppers_with_override_True` : this test uses the override argument
     - `test_kineto_profiler_multiple_steppers_with_override_False` : this test uses the environment variable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90101
Approved by: https://github.com/albanD
2023-01-13 18:07:40 +00:00
6783db13ef Update CMakeLists.txt since MacOS linker doesn't support whole-archive (#91736)
--whole-archive is a linker option(notice, that flag is passed as -Wl,--whole-archive), and -force_load is indeed available on MacOS platform (below is the quote from man ld):

 -force_load path_to_archive
        Loads all members of the specified static archive library.  Note:
        -all_load forces all members of all archives to be loaded.  This
        option allows you to target a specific archive.

Quote from malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91736
Approved by: https://github.com/larryliu0820
2023-01-13 18:03:02 +00:00
745fe35df5 [follow-up] Python Attr Serialization (#88913)
Ref: https://github.com/pytorch/pytorch/pull/81616#issuecomment-1307595402
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88913
Approved by: https://github.com/albanD
2023-01-13 17:38:51 +00:00
a72bcb3388 Do not leak SkipFrame exception to parent frames (#91059)
Discovered by https://github.com/pytorch/torchdynamo/issues/2000, we noticed the exception `SkipFrame` to avoid repeatedly compiling frame of loop with graph breaks could leak to parent frames while inlining, which then prevents compiling.

This PR checks at inlining if such exception is raised and would instead raise an `Unsupported` to the outer frame. The original behavior and goal of #88857 is unaffected: the inner frame that has loop would still be skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91059
Approved by: https://github.com/jansel, https://github.com/thiagocrepaldi
2023-01-13 17:11:22 +00:00
a60125e298 add docstring for adam differentiable parameter (#91881)
Fixes #90467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91881
Approved by: https://github.com/janeyx99
2023-01-13 17:08:27 +00:00
8f1c3c68d3 [BE] Use nested namespaces in .cpp/.cu files (#92100)
As we live in C++17 world

This is a functional no-op, just
- `s/namespace at { namespace native {/namespace at::native {/`
- `s/namespace torch { namespace jit {/namespace torch::jit {/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92100
Approved by: https://github.com/izaitsevfb
2023-01-13 16:32:34 +00:00
a4a0195c6c Fix torch.where signature mismatch that was caused by torchgen (#91627)
Fixes #91003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91627
Approved by: https://github.com/albanD
2023-01-13 16:17:55 +00:00
accecd7b04 [torchdim] Fix Python 3.11 bytecode decoding in dims (#91290)
Adds a PyInstDecoder object that handles the differences in bytecode
added in 3.11. Basically some instructions have inline caches which
change the size of the instruction, so calculating the next instruction
is slightly different.

fixes #91246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91290
Approved by: https://github.com/albanD
2023-01-13 16:15:23 +00:00
60e37a6e08 Update sgd doc to insist on momentum buffer initial value (#92111)
Following the discussion in https://github.com/pytorch/pytorch/pull/91108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92111
Approved by: https://github.com/soumith, https://github.com/janeyx99
2023-01-13 15:50:57 +00:00
a26e5e21b5 Improve type hints for Module forward hooks (#92061)
Fixes #91654.

Currently, the `hook` parameters of `nn.Module.register_forward_pre_hook` and `nn.Module.register_forward_hook` are typed as `Callable[..., None]`, which 1) does not enable the validation of the signature of `hook` and 2) incorrectly restricts the return type of `hook`, which the docstrings of these methods themselves state can be non-`None`.

The typing of the first parameter of `hook` as `TypeVar("T", bound="Module")` allows the binding of `Callable` whose first parameter is a subclass of `Module`.

---

Here are some examples of:
1. forward hooks and pre-hook hooks being accepted by mypy according to the new type hints
2. mypy throwing errors d.t. incorrect `hook` signatures
3. false negatives of pre-hooks being accepted as forward hooks
4. false negatives of hooks with kwargs being accepted irrespective of the value provided for `with_kwargs`

```python
from typing import Any, Dict, Tuple

import torch
from torch import nn

def forward_pre_hook(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
) -> None:
    ...

def forward_pre_hook_return_input(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
) -> Tuple[torch.Tensor, ...]:
    ...

def forward_pre_hook_with_kwargs(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
    kwargs: Dict[str, Any],
) -> None:
    ...

def forward_pre_hook_with_kwargs_return_input(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
    kwargs: Dict[str, Any],
) -> Tuple[Tuple[torch.Tensor, ...], Dict[str, Any]]:
    ...

def forward_hook(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
    output: torch.Tensor,
) -> None:
    ...

def forward_hook_return_output(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
    output: torch.Tensor,
) -> torch.Tensor:
    ...

def forward_hook_with_kwargs(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
    kwargs: Dict[str, Any],
    output: torch.Tensor,
) -> None:
    ...

def forward_hook_with_kwargs_return_output(
    module: nn.Linear,
    args: Tuple[torch.Tensor, ...],
    kwargs: Dict[str, Any],
    output: torch.Tensor,
) -> torch.Tensor:
    ...

model = nn.Module()

# OK
model.register_forward_pre_hook(forward_pre_hook)
model.register_forward_pre_hook(forward_pre_hook_return_input)
model.register_forward_pre_hook(forward_pre_hook_with_kwargs, with_kwargs=True)
model.register_forward_pre_hook(forward_pre_hook_with_kwargs_return_input, with_kwargs=True)

model.register_forward_hook(forward_hook)
model.register_forward_hook(forward_hook_return_output)
model.register_forward_hook(forward_hook_with_kwargs, with_kwargs=True)
model.register_forward_hook(forward_hook_with_kwargs_return_output, with_kwargs=True)

# mypy(error): [arg-type]
model.register_forward_pre_hook(forward_hook)
model.register_forward_pre_hook(forward_hook_return_output)
model.register_forward_pre_hook(forward_hook_with_kwargs)
model.register_forward_pre_hook(forward_hook_with_kwargs_return_output)

model.register_forward_hook(forward_pre_hook)
model.register_forward_hook(forward_pre_hook_return_input)

# false negatives
model.register_forward_hook(forward_pre_hook_with_kwargs)
model.register_forward_hook(forward_pre_hook_with_kwargs_return_input)

model.register_forward_pre_hook(forward_pre_hook_with_kwargs, with_kwargs=False)
model.register_forward_pre_hook(forward_pre_hook_with_kwargs_return_input, with_kwargs=False)
...
```

---

Though it is not functional as of mypy 0.991, the ideal typing of these methods would use [`typing.Literal`](https://mypy.readthedocs.io/en/stable/literal_types.html#literal-types):

```python
T = TypeVar("T", bound="Module")

class Module:

    @overload
    def register_forward_hook(
        self,
        hook: Callable[[T, Tuple[Any, ...], Any], Optional[Any]],
        *,
        prepend: bool = ...,
        with_kwargs: Literal[False] = ...,
    ) -> RemovableHandle:
        ...

    @overload
    def register_forward_hook(
        self,
        hook: Callable[[T, Tuple[Any, ...], Dict[str, Any], Any], Optional[Any]],
        *,
        prepend: bool = ...,
        with_kwargs: Literal[True] = ...,
    ) -> RemovableHandle:
        ...

    def register_forward_hook(...):
        ...

```

which would:

1. validate the signature of `hook` according to the corresponding literal value provided for `with_kwargs` (and fix the false negative examples above)
2. implicitly define the [fallback `bool` signature](https://github.com/python/mypy/issues/6113#issuecomment-1266186192) e.g. to handle if a non-literal is provided for `with_kwargs`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92061
Approved by: https://github.com/albanD
2023-01-13 15:45:42 +00:00
890b68281a Add kwargs support to torch.export() API (#92013)
Fixes [#1997](https://github.com/pytorch/torchdynamo/issues/1997)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92013
Approved by: https://github.com/jansel
2023-01-13 15:17:26 +00:00
b3e4f5029b Add check-sparse-tensor-invariants flag to Context - 2nd try. (#92094)
This PR is a copy of https://github.com/pytorch/pytorch/pull/90849 that merge was reverted.

The PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:

`torch.sparse.check_sparse_tensor_invariants` class provides different ways to enable/disable the invariant checking.

`torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden.

The PR fixes https://github.com/pytorch/pytorch/issues/90833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92094
Approved by: https://github.com/cpuhrsch
2023-01-13 14:50:33 +00:00
a111dd9014 [dynamo] support comparing numpy ndarray (#91870)
The output of Torchbench model `doctr_det_predictor` on CPU is a `numpy ndarray`. When running the accuracy benchmark of this model, the below error is raised: `RuntimeError: unsupported type: ndarray`.
Repro CMD:
```bash
python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcpu  -n50 --inductor  --no-skip --dashboard --only doctr_det_predictor --batch_size 1 --threads 1
```

This PR adds the support to compare `numpy ndarray` in the dynamo utils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91870
Approved by: https://github.com/jgong5, https://github.com/Chillee
2023-01-13 12:11:49 +00:00
fa3841ffd4 [ONNX] Fix potential flaky test in test_verification.py (#92105)
Very low probability, but it is possible to have all values positive throughout the
execution of this test model. The test tries to fake an incorrect export by replacing
relu's output with its input. However, the behavior of the model is the same when
values are all positive. Hence leading to false test failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92105
Approved by: https://github.com/titaiwangms
2023-01-13 07:56:24 +00:00
ec3941ada6 [quant][fx] Add support for GRU in fx graph mode quantization (#91976)
Summary:
might be needed by a meta-internal use case

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_rnn

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91976
Approved by: https://github.com/jcaip
2023-01-13 07:00:12 +00:00
0bd3fa3d22 [Quant][docs] Move parts of BackendConfig tutorial (#91999)
Summary: This commit moves the API specification section of
the BackendConfig tutorial to the docstrings, which is a more
suitable place for this content. This change also reduces some
duplication. There is no new content added in this change.

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91999
Approved by: https://github.com/vkuzo, https://github.com/jerryzh168
2023-01-13 05:59:22 +00:00
a617d031ff [Inductor Perf CI] Enable perf CI smoke test (#92051)
This tries to detect perf regression e.g. https://github.com/pytorch/pytorch/pull/91316#issuecomment-1370370885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92051
Approved by: https://github.com/seemethere, https://github.com/desertfire
2023-01-13 05:47:17 +00:00
eb7b89771e unify reduction types from different operators: scatter, scatter_reduce, segment_reduce (#91499)
The target of this PR is to unify `ReductionType` for reduce operators so that we have the same set of reduce utils for `init`, or `update` for vectorization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91499
Approved by: https://github.com/ngimel
2023-01-13 04:32:34 +00:00
a70387f0fa [vision hash update] update the pinned vision hash (#92119)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92119
Approved by: https://github.com/pytorchbot
2023-01-13 04:16:33 +00:00
fbbb19599a Update dynamic skips after #92076 (#92103)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92103
Approved by: https://github.com/voznesenskym, https://github.com/Chillee
2023-01-13 04:05:10 +00:00
9412778d51 Fix OneCycleLR error log (#92040)
If we call the scheduler 11 times but the number of expected steps is 10, we should print `Tried to step 11 times`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92040
Approved by: https://github.com/janeyx99
2023-01-13 02:46:59 +00:00
61cdae0ce5 Switch Windows CI jobs to G5 runners (#91727)
### Changelist

* Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU
* Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by https://github.com/pytorch/pytorch/pull/91979
* G5 runner has `AMD EPYC 7R32` CPU, not an Intel one
  * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`.  This might need to be investigated further (TODO: TRACKING ISSUE).  In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build
  * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.

###  Performance gain

* (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
* (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**)

### Pricing

On demand hourly rate:

* (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
* (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36

So the current runner is not only more expensive but is also slower.  Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%**

### Rolling out

https://github.com/pytorch/test-infra/pull/1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91727
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere
2023-01-13 01:11:59 +00:00
b7cad020b5 [DTensor] require DeviceMesh size equals world size (#91801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91801
Approved by: https://github.com/wanchaol
2023-01-12 22:37:55 +00:00
3dd9dbd942 [DTensor] create default process group when absent (#91756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91756
Approved by: https://github.com/wanchaol
2023-01-12 22:37:55 +00:00
f8e641bad4 Revert "Make ModuleList derive from Sequence[T] and type it appropriately (#89135)"
This reverts commit d0bfd79f3d1bbf8885b00acb6d72db0bc16f1995.

Reverted https://github.com/pytorch/pytorch/pull/89135 on behalf of https://github.com/albanD due to Is actually breaking user code
2023-01-12 22:04:02 +00:00
8fa66a6337 [quant][pt2e] Add a test to confirm we can set qconfig according to module_name (#91977)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_qconfig_none

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91977
Approved by: https://github.com/jcaip
2023-01-12 21:59:02 +00:00
6f749fd171 Fixes to DSA infra (#91835)
Differential Revision: D42397325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91835
Approved by: https://github.com/soumith
2023-01-12 21:54:26 +00:00
4636fe701c Limit the memory and CPU of Bazel build to avoid crashing the runner (#92056)
I'm seeing quite a number of runner errors "i-NUMBER lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error" with Bazel build and test job, i.e. https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=bazel

The job runs on normal `linux.2xlarge` runner.  As the error doesn't occur with any other jobs running on the same type of runner with the exception of XLA.  I suspect that this is due to a resource constraint crashing the runner.  So this PR sets a limit to the amount of memory and CPU and bazel can use.  Even if bazel crashes, i.e. with OOM error, it's still better than crashing the whole runner and losing all the logs.

Example failures:

* 33e3c9ac67
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92056
Approved by: https://github.com/ZainRizvi
2023-01-12 21:51:16 +00:00
7078ad5b8c Reland "AOT Autograd refactor + cleanup, handle intermediate views of bases, use view replay, fix non-tensor input handling" (#92076)
Original PR: https://github.com/pytorch/pytorch/pull/89532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92076
Approved by: https://github.com/janeyx99, https://github.com/albanD
2023-01-12 21:32:05 +00:00
da77b10b41 fix in-place geometric pmf (#92049)
See https://github.com/pytorch/pytorch/pull/37984#discussion_r1059548320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92049
Approved by: https://github.com/lezcano
2023-01-12 19:56:44 +00:00
5f55335c2e Fixed output memory format mismatch for bicubic2d (#90470)
Description:

- output memory format is matching input for bicubic2d

Problem: output tensor's memory format does not match input format for bicubic2d

```python
import torch

i = torch.rand(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
assert i.is_contiguous(memory_format=torch.channels_last)
o = torch.nn.functional.interpolate(i, size=(4, 4), mode="bicubic")
assert o.is_contiguous(memory_format=torch.channels_last), f"Should be channels last but given channels first ({o.is_contiguous(memory_format=torch.contiguous_format)})"

> AssertionError: Should be channels last but given channels first (True)
```

Related PR fixing bilinear ops: https://github.com/pytorch/pytorch/pull/53535 (cc @VitalyFedyunin @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @bdhirsh )

Discovered together with @NicolasHug while working on https://github.com/pytorch/pytorch/tree/interpolate_uint8_images_linear_cpu_support_dev

- Updated code to match grad input / output memory formats
- temporary tensor creation matches memory format in `separable_upsample_generic_Nd_kernel_impl`
- Updated tests
- Added missing forward AD support for bicubic with antialiasing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90470
Approved by: https://github.com/NicolasHug, https://github.com/lezcano
2023-01-12 19:52:28 +00:00
c4a6f21b50 [JIT] Add tests for pow() with different dtype inputs (#91946)
Fixes #75476

Apparently this NNC bug has been fixed at some point. Adding tests to track this and verify via CI that this is actually fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91946
Approved by: https://github.com/qihqi
2023-01-12 19:39:55 +00:00
515dff7811 [functorch] move batch_norm_replacement to torch.func (#91412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91412
Approved by: https://github.com/zou3519
2023-01-12 19:15:41 +00:00
7bdcf6d4f0 Revert "[FSDP] Do not clean FQNs even for use_orig_params=True (#91767)"
This reverts commit a383789f4d8ecb36adaff6bd3746430209ff0546.

Reverted https://github.com/pytorch/pytorch/pull/91767 on behalf of https://github.com/huydhn due to This breaks inductor_distributed workflow a383789f4d
2023-01-12 19:07:50 +00:00
91920ee6da sparse_mask: remove redundant mask.coalesce() in to_dense_backward (#92001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92001
Approved by: https://github.com/cpuhrsch
2023-01-12 17:50:06 +00:00
b9182cbbd8 Fixup torch jit with some initializers and moves (#92037)
Fixup some minor codequality issues in torch JIT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92037
Approved by: https://github.com/ezyang
2023-01-12 17:29:24 +00:00
5625f521a4 generate set_device call to ensure context existence (#92055)
Hopefully Fixes https://github.com/pytorch/torchdynamo/issues/2026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92055
Approved by: https://github.com/wconstab
2023-01-12 17:23:49 +00:00
7c641eaaf0 [Inductor] Support vectorized transpose in CPP backend (#91532)
Fix https://github.com/pytorch/torchdynamo/issues/1915
This PR adds the vectorization support for transposed operations in TorchInductor CPP backend. It contains the following changes:
1. `CppTile2DKernelChecker` is added to check the eligibility of applying the optimization. We only addresss a narrow set of situations. All of the following conditions should be met: 1) There exists one and only one fp32 load/store with outer loop var having contiguous buffer accesses. 2) When a load/store doesn't have contiguous access in an outer loop var, the access should be vectorizable from the inner-most dim. 3) No reduction. More scenarios/operations would be supported in the future PRs.
2. If `CppTile2DKernelChecker` reports the optimization is doable, `CppKernelProxy` would split/tile the loops from both the outer loop var having contiguous buffer access and the inner-most loop var.
3. The main loop split from the outer loop var is further split at the inner-most level and then handled by `CppTile2DKernel` and `CppTile2DTailKernel` which generate the transposed load/store. The former kernel does the vectorized transposed load/store on tiles and then does vectorized load/store/compute along the inner-most loop axis. The vectorized transpose micro-kernel implementation borrows/refers to that from FBGEMM. The latter kernel simply does scalar operations.
4. The tail loop split from the outer loop var directly calls `CppKernel` with scalar operations.

Next steps:
1. Support vectorized transpose with smaller tile size at one dim but bigger tile size at the other, e.g., 3x784.
2. Support reduction vectorized on the outer loop var (contiguous from outer loop var, not with inner-most loop var)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91532
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-01-12 17:20:39 +00:00
eece6da162 [inductor] Reduce device context manager overhead (#91045)
This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-01-12 16:51:59 +00:00
db466ae057 Revert "[Modes] Add assert that the mode isn't already on the stack (#90770)"
This reverts commit 702838637d63936460ea2bf00b64ffec86ed6687.

Reverted https://github.com/pytorch/pytorch/pull/90770 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-12 16:44:29 +00:00
a383789f4d [FSDP] Do not clean FQNs even for use_orig_params=True (#91767)
Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together.

There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91767
Approved by: https://github.com/zhaojuanmao
2023-01-12 15:14:14 +00:00
7f50ff1685 [FSDP] Test use_orig_params=True, no_sync(), mixed precision (#91193)
This makes some minor fixes to ensure that `use_orig_params=True`, `no_sync()`, and mixed precision work together for `FULL_SHARD`, `SHARD_GRAD_OP`, and `NO_SHARD`.

The added unit test only checks that dtypes are correct since for FP16, it is hard to test for numeric parity against a baseline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91193
Approved by: https://github.com/zhaojuanmao
2023-01-12 15:14:14 +00:00
e5503aceae [FSDP] Re-support model dtype change after FSDP init (#91192)
Closes https://github.com/pytorch/pytorch/issues/90838.

To make mixed precision precise internally, https://github.com/pytorch/pytorch/pull/90660 changed the implementation to save `_orig_param_dtype`, `_low_prec_param_dtype`, and `_reduce_dtype` explicitly. However, these are computed at FSDP construction time, so it does not allow the user to change the model dtype after FSDP construction time but before lazy initialization. This PR recomputes those dtype attributes as needed if the model dtype changes in that window.

Note that any mixed precision settings specified by the user take precedence over the model dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91192
Approved by: https://github.com/zhaojuanmao
2023-01-12 15:14:10 +00:00
e096d2db5a [BC-Breaking] Separate stream_id, device_index, and device_type in pack and unpack for Streams (#81596)
#75854

A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`.

Stills needs sanity checks, testing, and minimization of BC-breaking changes.

Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596
Approved by: https://github.com/ezyang
2023-01-12 14:16:49 +00:00
a2368a7c13 [dynamo] delegate handling of len() of TensorVariable to size(0) (#92016)
We delegate the handling logic of __len__ in TensorVariable to size(0). This seems to also fix several expected failures that are related to len().

Fixes #91901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92016
Approved by: https://github.com/ezyang
2023-01-12 13:40:48 +00:00
3ab58fd5ed optimize sampled_addmm performance on CPU (SparseCSR) (#90978)
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-01-12 12:04:07 +00:00
81f7c40612 Cleanup some unused includes (#91961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91961
Approved by: https://github.com/lezcano
2023-01-12 11:53:52 +00:00
8acf0e62d0 Use c10 math constants consistently in Math.h (#91967)
On MSVC the `M_` constants are hidden behind the `USE_MATH_DEFINES` macro, so
it's better to avoid them in headers otherwise the include order can break
compilation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91967
Approved by: https://github.com/malfet
2023-01-12 11:53:52 +00:00
c7a22bb7c7 Revert "Add check-sparse-tensor-invariants flag to Context. (#90849)"
This reverts commit b9a035c1c58630f3eef5242cb4849881b8376b39.

Reverted https://github.com/pytorch/pytorch/pull/90849 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-12 09:58:16 +00:00
05d0c4cee3 [functorch] Fix proxy unwrapping for cond(). (#91907)
In control_flow.cond(), we unwrap arguments' proxy by using
get_proxy_slot() call which call a lambda in the end to get the stored
proxy. For SymInt and SymFloat we hide the proxy under a thunk instead
of storing proxy on .proxy attribute diretly, therefore we need to
special case SymInt for unwrapping here.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91907
Approved by: https://github.com/ezyang
2023-01-12 08:45:12 +00:00
a76bc410df Fix _foreach_norm on some tensor sizes (#91844)
This PR fixes 2 bugs with CUDA `_foreach_norm`:

1. Wrong norm when tensors are larger than kChunkSize = 65536
```
>>> torch._foreach_norm([torch.ones(60000, device="cuda") for _ in range(1)])
(tensor(244.9490, device='cuda:0', grad_fn=<NotImplemented>),)
>>> torch._foreach_norm([torch.ones(70000, device="cuda") for _ in range(1)])
(tensor(256., device='cuda:0', grad_fn=<NotImplemented>),)

>>> torch.ones(60000, device="cuda").norm()
tensor(244.9490, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>)
>>> torch.ones(70000, device="cuda").norm()
tensor(264.5751, device='cuda:0', grad_fn=<LinalgVectorNormBackward0>)
```

2. Error when a tensor numel is smaller than the number of tensors

```
>> torch._foreach_norm([torch.ones(9, device="cuda") for _ in range(10)])
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
IndexError: select(): index 9 out of range for tensor of size [9] at dimension 0
```

This bug could have been caught by tests if `PYTORCH_TEST_WITH_SLOW` was 1, because it would have tested tensors of size 300*300=90000. It's not enabled by default, does someone know if it's ever enabled?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91844
Approved by: https://github.com/ngimel
2023-01-12 05:48:01 +00:00
44413f2525 properly convert fill value to x dtype in constant_pad (#92045)
Fixes #92038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92045
Approved by: https://github.com/desertfire
2023-01-12 05:41:10 +00:00
eqy
fb38b9ff2a [cuBLAS][TF32] Fix TF32 get/set test when TORCH_ALLOW_TF32_CUBLAS_OVERRIDE is set (#92052)
Follow up of #85859 to fix the test for when the environment variable is set.

CC @xwang233 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92052
Approved by: https://github.com/ngimel
2023-01-12 05:36:06 +00:00
ffbd13b654 Fix for swap_custom_module_to_observer doing duplicate swaps on the same node.target (#91905)
Summary:
This is a fix for the following issue:
"When two nodes in a model have the same dTypes / node.target, the torch quantization prepare_fx flow does not check for duplicates and tries to do a custom module swap twice. When it attempts the swap the same target for a second time, the swap_custom_module_to_observed detects the observed module instead of the float module class on the target, and fails on an assertion. "

The added unit test demonstrates a simple example where it fails in absence of this fix.

Test Plan: buck test mode/dev //caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_custom_module_class_input_has_duplicate_nodes (quantization.fx.test_quantize_fx.TestQuantizeFx)'

Reviewed By: vkuzo

Differential Revision: D42023273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91905
Approved by: https://github.com/jerryzh168
2023-01-12 05:24:38 +00:00
ccd8b66b0a [testing] add ErrorInputs for adaptive_{avg, max}_poolnd (#90924)
Ref: https://github.com/pytorch/pytorch/pull/88906#discussion_r1040157313

Covers:
- [x] adaptive_avg_pool1d
- [x] adaptive_avg_pool2d
- [x] adaptive_avg_pool3d
- [x] adaptive_max_pool1d
- [x] adaptive_max_pool2d
- [x] adaptive_max_pool3d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90924
Approved by: https://github.com/mruberry
2023-01-12 05:24:01 +00:00
6cfaa92239 Handle tensor default func args when inlining (#90575)
Handle tensor default func/method args when inlining

    Previously, when inlining a function, its default arguments
    were only wrapped with VariableTrackers if non-tensor. Now,
    tensor default args are also handled by adding them to the
    parent InstructionTranslator as an attribute.

    - also patches up a missing source in nnmodule call_function,
      needed to properly guard on a default arg in its methods
    - adds new 'DefaultsSource' type which guards either a `__defaults__`
      or `__kwdefaults__` entry on a function

Fixes #90361  https://github.com/pytorch/torchdynamo/issues/1968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90575
Approved by: https://github.com/voznesenskym
2023-01-12 05:04:18 +00:00
8e2e648f84 Propagate sources in VariableBuilder and add SuperSource (#91729)
**Motivation**
When adding support for default args (#90575), a lot of VariableTrackers missing sources were encountered.  Currently, in a lot of cases it seems OK to skip the source for VariableTrackers created (especially during inlining), but that assumption breaks down when inlining functions with default arguments.

**Summary** of changes
- propagate the self.source of the VariableBuilder to the new variables being built, which seems like it was an omission previously
- Add SuperSource to track usages of super(), so that SuperVariables can support function calls with default args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91729
Approved by: https://github.com/ezyang
2023-01-12 05:04:18 +00:00
07e595e88a Add device_idx to free_fn in CUDAPluggableAllocator (#91398)
This was requested by nvidia folks, track also the device_id in the free function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91398
Approved by: https://github.com/albanD
2023-01-12 05:03:48 +00:00
723d7641e2 [vision hash update] update the pinned vision hash (#91744)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91744
Approved by: https://github.com/pytorchbot
2023-01-12 04:08:10 +00:00
18677d5249 sparse_mask: faster, with support for uncoalesced mask (#91964)
This PR updates `sparse_mask` to be:
* about 30% faster on CUDA.
* able to support uncoalesced masks.
* much shorted code-wise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91964
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-01-12 04:02:05 +00:00
3305265962 [FSDP] Clarify MixedPrecision docs (#91974)
New docs:
![Screen Shot 2023-01-10 at 8 07 19 PM](https://user-images.githubusercontent.com/31054793/211694428-c8ebf210-85c5-4b8a-a174-ee8022d8b8fd.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91974
Approved by: https://github.com/zhaojuanmao
2023-01-12 03:41:58 +00:00
8612ec5b90 Implement hybrid sparse to/from dense conversions. (#90177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90177
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-01-12 03:31:30 +00:00
e1bcbbf18c [Quant] make x86 the default quantization backend (qengine) (#91235)
**Summary**
Make x86 the default quantization backend (qengine) for X86 CPU platforms.
X86 is a unified quantization backend combining goodness of fbgemm and onednn. For more details please see https://github.com/pytorch/pytorch/issues/83888

**Test plan**
python test/test_quantization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91235
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/malfet
2023-01-12 02:14:28 +00:00
5766764d6c [functorch] Fix map() operator behavior. (#91906)
3 fixes made to control_flow.map:
1. argument list won't accept torch.nn.Module anymore, only Tensors.
2. during tracing we call new_empty from the returned sample output
instead xs to correctly inherit tensor metadata.
3. for FakeTensorMode we implement map() using new_empty() as well
instead of torch.stack() to preserve symbolic shape output.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91906
Approved by: https://github.com/tugsbayasgalan
2023-01-12 01:54:34 +00:00
b8252e07c7 [Reland] add DisableTorchFunction that matches DisableTorchDispatch (#88219) (#92012)
Reland of #88219

Closes #87990. This implements a new disable guard that matches DisableTorchDispatch (disables all subclasses and modes)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92012
Approved by: https://github.com/albanD
2023-01-12 01:27:47 +00:00
6676193b5e [frontend] Expose real_type getter for torch.Argument (#91938)
Exposing an API to get real_type from an Argument. This is useful for Argument types such as SymInt.

Differential Revision: [D42425661](https://our.internmc.facebook.com/intern/diff/D42425661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91938
Approved by: https://github.com/ezyang
2023-01-12 01:26:50 +00:00
dc6916b341 optimize gather performance for gnn usage on CPU (#87586)
On classic pyg user case for message passing, `gather` has `index` tensor in a broadcasted shape, e.g. with shape `5000, 128` and stride `[1, 0]`. That indicated gather is done on each row of the self tensor. The current implementation will try to parallel on the inner dimension which is bad performance for CPU and unable to be vectorized.

This PR addressed this use case and optimize in a similar manner to index_select, parallel on outer dimension of `index` and do vectorized copy on inner dimension.

Performance benchmarking on Xeon Icelake single socket on `GCN`: the `gather` reduced from `150.787ms` to `10.926ms`, after this optimization, `gather` will no longer be the major bottleneck for training of GNN models when `EdgeIndex` is in COO format.

for more details, please refer to https://github.com/pyg-team/pytorch_geometric/issues/4891#issuecomment-1288423705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87586
Approved by: https://github.com/rusty1s, https://github.com/malfet
2023-01-12 00:43:43 +00:00
eqy
f8026413f5 Fix CUDA_MAX_THREADS_PER_SM for sm_89 (#91972)
Basically the same as #88644, to fix warnings like `ptxas warning : Value of threads per SM for entry _ZN2at6native13reduce_kernelILi512ELi1ENS0_8ReduceOpIfNS0_10NormTwoffEEjfLi4EEEEEvT1_ is out of range. .minnctapersm will be ignored`

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91972
Approved by: https://github.com/ngimel
2023-01-12 00:30:27 +00:00
3613ff06b1 [MKLDNN] Rename pooling_avg to pooling_avg_exclude_padding (#90247)
**Summary**
Rename `pooling_avg` to `pooling_avg_exclude_padding` to align with onednn v3.0. It does not affect correctness or performance. Same as https://github.com/pytorch/pytorch/pull/87851 . Looks like https://github.com/pytorch/pytorch/pull/87851 did not cover all occurrences.

**Test plan**
python test/test_mkldnn.py
python caffe2/python/ideep/pool_op_test.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90247
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-01-12 00:08:30 +00:00
c537f5bee8 [ONNX] Documentation for torch.onnx.find_mismatch (#90728)
Doc preview:
* `find_mismatch`: https://docs-preview.pytorch.org/90728/onnx.html#torch.onnx.verification.find_mismatch
* `GraphInfo`: https://docs-preview.pytorch.org/90728/onnx.html#classes and https://docs-preview.pytorch.org/90728/generated/torch.onnx.verification.GraphInfo.html#torch.onnx.verification.GraphInfo
* `VerificationOptions`: https://docs-preview.pytorch.org/90728/onnx.html#classes and  https://docs-preview.pytorch.org/90728/generated/torch.onnx.verification.VerificationOptions.html#torch.onnx.verification.VerificationOptions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90728
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-01-11 23:58:57 +00:00
ed7885c254 [utils][foreach] Add group tensor by device and dtype util (#92014)
Add util that will be commonly used throughout optim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92014
Approved by: https://github.com/albanD
2023-01-11 23:37:20 +00:00
af242eedfb [Inductor] Added aten.uniform_ decomp (#90869)
Fixes #90815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90869
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel, https://github.com/albanD
2023-01-11 23:23:42 +00:00
f40777e4ad [Dynamo] Fix guard bug when np.float used in control flow (#91991)
Fixes 14k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_Sanster_lama_cleaner.py#L2392

Error
```
File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/guards.py", line 263, in CONSTANT_MATCH
    self.EQUALS_MATCH(guard)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/guards.py", line 197, in EQUALS_MATCH
    assert istype(
AssertionError: float64
```

```np.float``` is unspecialized by default, which has guard on ```TYPE_MATCH```. However, it will be baked when being used in control flow, which has guard on ```EQUALS_MATCH```. We should make ```EQUALS_MATCH``` support ```np.float```.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91991
Approved by: https://github.com/jansel
2023-01-11 23:16:56 +00:00
8007c2d96a Python Script Object to IValue (#91776)
Summary: * when we try to port py obj of script module/obj to c++, `tryToInferType` is flawed in providing type inference metadata. but change it would break normal torch.jit.script flow, so we try extract the ivalue in the py obj value.

Test Plan: NA

Reviewed By: PaliC

Differential Revision: D41749823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91776
Approved by: https://github.com/842974287
2023-01-11 23:06:57 +00:00
8b00c54425 Add utility report_compile_source_on_error (#91069)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91069
Approved by: https://github.com/soumith, https://github.com/albanD
2023-01-11 22:54:46 +00:00
4e21fc2075 In inductor triton generated code, avoid masking when numel=1 (#91254)
This is implementing an idea from @lezcano : if we have a generated triton kernel with `xnumel=1`, then `xmask` is just `0<1` and can be dropped from all `load`/`store`/`where`.

The `xnumel=1` case actually comes up relatively often when code for reductions is being generated. @lezcano reported some  performance gains in micro-benchmarks (see comment below) and it is a very simple change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91254
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-01-11 22:40:06 +00:00
33e3c9ac67 Not explicitly set the manifest filename in Windows (#91988)
I'm at a loss to explain why this happens, but not setting the manifest file explicitly in the linker fixes it.

### Testing locally

* With `/MANIFESTFILE:bin\torch_python.dll.manifest`
```
C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_python.dll.manifest

LINK : fatal error LNK1000: Internal error during CImplib::EmitImportThunk
```

* Work fine without the flag
```
C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST
```

In both case, the `/MANIFEST` flag is set, so the manifest file is there.  In the latter case, the filename comes by appending `.manifest` suffix to `bin\torch_python.dll`.  Thus, it's still correctly be `bin\torch_python.dll.manifest`.  Weird.

```
C:\actions-runner\_work\pytorch\pytorch>ls -la build/bin/torch_*
-rwxr-xr-x 1 runneruser 197121 246796288 Jan 11 04:30 build/bin/torch_cpu.dll
-rw-r--r-- 1 runneruser 197121       381 Jan 11 04:26 build/bin/torch_cpu.dll.manifest
-rwxr-xr-x 1 runneruser 197121      9728 Jan 11 03:55 build/bin/torch_global_deps.dll
-rw-r--r-- 1 runneruser 197121       381 Jan 11 03:55 build/bin/torch_global_deps.dll.manifest
-rwxr-xr-x 1 runneruser 197121  11746816 Jan 11 04:31 build/bin/torch_python.dll
-rw-r--r-- 1 runneruser 197121       381 Jan 11 04:30 build/bin/torch_python.dll.manifest
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91988
Approved by: https://github.com/malfet, https://github.com/Blackhex, https://github.com/ZainRizvi
2023-01-11 22:28:08 +00:00
a155f64957 Update _optim_utils.py (#91935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91935
Approved by: https://github.com/awgu, https://github.com/fegin
2023-01-11 22:06:26 +00:00
28c736a424 Third batch of canonical aten ops (#91995)
Following aten ops appears as high frequency ops in the 14k github crawl model, and they don't have decomps:
https://github.com/jansel/pytorch-jit-paritybench

as_strided
floor
select.int
topk
max_pool3d_with_indices
reflection_pad2d
replication_pad2d
replication_pad3d

Full dump of aten ops from 14k model can be found here: https://docs.google.com/spreadsheets/d/1sEt0HD-0YAF5lfdOUPPZd2xIvwPL0emE7GaiqgMaTSM/edit?usp=sharing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91995
Approved by: https://github.com/ezyang
2023-01-11 21:36:17 +00:00
d0bfd79f3d Make ModuleList derive from Sequence[T] and type it appropriately (#89135)
I see https://github.com/pytorch/pytorch/issues/53103 says this might be problematic, but I'm a bit confused at this point, because it looks like ModuleList does in fact already adhere to the Sequence API

The big win here is that for homogenous ModuleLists, you now get typing for individual members, e.g.
`ModuleList([Linear(), Linear(), Linear()])[1]` properly has type `Linear`

If this looks good, I can do a followup PR to do similarly for `ModuleDict` and `Parameter[List,Dict]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89135
Approved by: https://github.com/albanD
2023-01-11 21:21:32 +00:00
c5836153f5 Revert "optimize sampled_addmm performance on CPU (SparseCSR) (#90978)"
This reverts commit 645fb217c06348a4f1ccdf68a93bd711f7158c62.

Reverted https://github.com/pytorch/pytorch/pull/90978 on behalf of https://github.com/seemethere due to This broke internal builds for android due to the new file added being missing in build_variables.bzl
2023-01-11 20:12:12 +00:00
74cbf058a5 Support --dynamic-ci-skips (#91893)
This makes it easier for us to run only the skipped benchmarks and
see if that actually started passing.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91893
Approved by: https://github.com/albanD
2023-01-11 20:02:58 +00:00
83e6e9dde3 Disable NVFuser in internal (Meta) build (#91836)
In preparation for https://github.com/pytorch/pytorch/pull/89621.

The build changes in #89621 would require re-writing the internal build
in order to get NVFuser support. As-is, #89621 would disable NVFuser in
the internal build; so I would need to add some internal-only changes
associated with the internal copy of the PR (not visible from github) to
fix the internal build.

However, I don't think NVFuser is actually being used internally
anywhere at the moment, so it may be easier to land #89621 as is, and
then we can fix the internal build later if needed. To verify that, I
want to land this PR instead to flush out any issues caused by disabling
NVFuser. If the PR lands without issues, then we can move on to landing #89621.
If the PR breaks things internally, then it will need to be reverted;
and that will probably be easier than having to revert and reland #89621.

Differential Revision: [D42398050](https://our.internmc.facebook.com/intern/diff/D42398050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91836
Approved by: https://github.com/jjsjann123
2023-01-11 19:33:10 +00:00
4806a9e7f6 Remove DL_RUNTIME_BUG (#91960)
This macro was made a no-op in #61903 and so we should clean up the surrounding
boiler-plate which used it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91960
Approved by: https://github.com/lezcano
2023-01-11 18:26:41 +00:00
6287bb78dc [static-runtime] clamp fast_sigmoid result into (0,1) range (#91993)
fast_sigmoid uses fast_tanh under the hood which is not precise;
the op outputs are treated as probability-like numbers;
in a reeeally small percentage of cases the outputs fell out of acceptable range for probabilities

Test Plan: ci

Differential Revision: D42445821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91993
Approved by: https://github.com/davidberard98
2023-01-11 17:41:42 +00:00
d8e795ecd5 [modes] make python arg parser also check for python key (#91573)
Fixes #90652

Previously, we had assumed that the only way to call `handle_torch_function_no_python_arg_parser` was through the Python key. This is no longer true with FakeTensor. Specifically `_like` functions will call `.device()` on FakeTensors when the args list is being parsed. In order to respect that the mode stack shouldn't run when the python key is off, this just adds that a check that the python key is on/the torch_function equivalent to that function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91573
Approved by: https://github.com/ezyang
2023-01-11 15:19:43 +00:00
702838637d [Modes] Add assert that the mode isn't already on the stack (#90770)
Redo of #89726 on a clean PR, thanks @voznesenskym for the first draft!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90770
Approved by: https://github.com/ezyang
2023-01-11 15:19:43 +00:00
8b3c4bc481 [stateless] add weight tying support (#90477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90477
Approved by: https://github.com/zou3519
2023-01-11 15:19:09 +00:00
e03ac0ee8c Add bf16 and change header file include path (#91838)
# Motivation
We would like to add the bfloat16 header file to PyTorch to make PyTorch and Intel extension for PyTorch support the bfloat16 data type.

# Solution
- Note that bfloat16 is an Intel extension implementation in the DPC++ compiler instead of standard SYCL, we need to guarantee the bfloat16 header can be included only using the DPC++ compiler. Please refer to [sycl 2020 feature test macros](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_feature_test_macros). Intel DPC++ compiler uses [SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc) to check bfloat16 feature.
- Refer to [intel/llvm](59dd38795c/clang/lib/Basic/Version.cpp (L129)). SYCL_LANGUAGE_VERSION is defined in both SYCL 1.2.1 and SYCL 2020. But only CL_SYCL_LANGUAGE_VERSION is defined in SYCL 1.2.1. So we should check CL_SYCL_LANGUAGE_VERSION first for SYCL 1.2.1. If it is not defined then check SYCL_LANGUAGE_VERSION for SYCL 2020. This will guarantee to be compatible with SYCL 1.2.1 and SYCL 2020.

# Additional
No need UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91838
Approved by: https://github.com/ezyang
2023-01-11 15:18:56 +00:00
d24324bf1d s/INDCUTOR/INDUCTOR/ (#91885)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91885
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
2023-01-11 12:28:19 +00:00
84b819d083 Preventing crashing incase of no network by loading from cache (#91569)
Fixes #91568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91569
Approved by: https://github.com/NicolasHug
2023-01-11 11:56:46 +00:00
850cf8949a enable conj() for sparse compressed tensors (#91695)
Fixes https://github.com/pytorch/pytorch/issues/91631.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91695
Approved by: https://github.com/pearu, https://github.com/cpuhrsch, https://github.com/albanD
2023-01-11 11:46:50 +00:00
56ed976edf hrnet_w18, tts_angular works with dynamic shapes (#91891)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91891
Approved by: https://github.com/voznesenskym
2023-01-11 11:40:16 +00:00
d7dc1c2fd5 Support zero dimensions in softmax decompositions (#91322)
The eager implementation of softmax supports computation along zero dimensions, but many of the other implementations did not, including:
* decompositions & refs (this was causing dynamo failures)
* forward AD for logsumexp
* MPS log_softmax_backward

This PR handles the `input.numel() == 0` cases separately to avoid running `amax()`, which fails for zero dimensions, and updates opinfos.

example of "computation along zero dimensions":

```python
# example of where
import torch

t = torch.rand((4, 0, 0))
print("~")
print(torch.nn.functional.softmax(t, dim=-1))  # this passes
print("~")
torch._refs.softmax(t, dim=-1)  # this fails
print("~")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91322
Approved by: https://github.com/lezcano
2023-01-11 09:35:43 +00:00
afd8dd085f replace vec::vec_scalar_t with at::opmath_type (#91086)
### Motivation
The two accumulation types vec::vec_scalar_t and at::opmath_type are duplicated, so we replace vec::vec_scalar_t with at::opmath_type, and vec::vec_scalar_t will be deprecated later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91086
Approved by: https://github.com/jgong5, https://github.com/mingfeima
2023-01-11 09:08:01 +00:00
3790b50505 inductor: fix .to(memort_format) issue which doesn't generate right stride (#91948)
Motivation: for **.to(memory_format),** the inductor doesn't generate the right stride, see the following example:
```
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        x = x.to(memory_format=torch.contiguous_format)
        return x
```

the generated code doesn't do the memory format change and gets a wrong stride **(802816, 1, 14336, 256)**, it is not a contiguous stride.

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    return (arg0_1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((128, 256, 56, 56), (802816, 1, 14336, 256), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```

After this PR, the will have a memory format change:

```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<128; i0+=1)
            {
                #pragma GCC ivdep
                for(long i1=0; i1<256; i1+=1)
                {
                    #pragma GCC ivdep
                    for(long i2=0; i2<3136; i2+=1)
                    {
                        auto tmp0 = in_ptr0[i1 + (256*i2) + (802816*i0)];
                        out_ptr0[i2 + (3136*i1) + (802816*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf1 = empty_strided((128, 256, 56, 56), (802816, 3136, 56, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg0_1
    return (buf1, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((128, 256, 56, 56), (802816, 1, 14336, 256), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91948
Approved by: https://github.com/ngimel
2023-01-11 08:23:26 +00:00
92855a215b [SDPA] Guard mem efficient attention in deterministic mode (#91979)
# Summary
Memory efficient attention is a non deterministic algorithm.

This PR ensures that the sdp_choice will allow for mem-efficient  to be used as the backend to SDPA if we are in warn only mode.  Otherwise  if we have enabled determinism and and set warn_only to False sdp_choice will not return memory efficient attention as the backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91979
Approved by: https://github.com/cpuhrsch
2023-01-11 07:40:31 +00:00
d540442e36 [ONNX] Fix 'prim::PackPadded' shape inference (#91829)
In `peephole` pass, user nodes of output of `prim::PackPadded` are modified to consume
the input of `prim::PackPadded` instead. Hence the logic in shape type inference. However
only the first output requires this workaround.

Fixes #91528
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91829
Approved by: https://github.com/titaiwangms
2023-01-11 07:35:55 +00:00
812d774cc9 Easy: add instructions for testing pytorch/builder (#91923)
Also makes the repo name configurable for branches in forks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91923
Approved by: https://github.com/malfet, https://github.com/seemethere
2023-01-11 07:26:46 +00:00
8f5f15a64b optimize scatter_add performance for gnn usage on CPU (#82703)
### Motivation of this PR

This PR is targeting at improving performance of `scatter_add` for GNN usage scenarios on PyG. Currently only CPU optimizations is covered.

`Message Passing` is the major step in GNN learning which means exchanging/aggregating info between nodes. And from the perf point of view, if the `EdgeIndex` is stored as [2, num_edges], `scatter_reduce` would be a major perf hotspot on current pytorch implementations.

To be more specific, in the process of message passing, `scatter_add` is used in a very similar way as `index_select`, except that the `self` tensor is written into while `index_select` is only reading. Therefore, the `index` tensor passed to `scatter_add` is an expanded tensor on dim0, which means all the rest of dims would end up with the same value.

### Algorithm

Current impl on scatter would do parallel on the inner dims for such case which would cause bad perf: non-contiguous memory access pattern and non-vectorized.

This PR did sorting on the `index` to solve the write conflicts if we directly parallel on dim0. The algorithm is equivalent to:
* convert memory format from `COO` to `CSR`
* do spmm reduce

### Perf improvement

The benchmark comes from https://github.com/pyg-team/pytorch_geometric/tree/master/examples, `python reddit.py` which runs model SAGE on dataset reddit.

CPU type: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

` aten::scatter_add_` has been reduced from **37.797s** to **5.989s**:

* breakdown before
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::scatter_add_        49.00%       37.797s        49.00%       37.797s      41.445ms           912
                                     aten::index_select        19.74%       15.223s        19.74%       15.227s       6.678ms          2280
                                           aten::linear         0.01%       5.706ms        15.04%       11.602s      12.721ms           912
                                            aten::addmm         6.62%        5.108s         7.92%        6.112s      13.403ms           456
                                           aten::matmul         0.00%       2.339ms         7.10%        5.475s      12.006ms           456
```

* breakdown after
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                     aten::index_select        32.41%       14.677s        32.42%       14.681s       6.439ms          2280
                                           aten::linear         0.01%       6.665ms        26.43%       11.968s      13.123ms           912
                                            aten::addmm        11.76%        5.328s        13.76%        6.232s      13.667ms           456
                                     aten::scatter_add_        13.22%        5.989s        13.22%        5.989s       6.566ms           912
                                           aten::matmul         0.01%       2.303ms        12.63%        5.720s      12.543ms           456
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82703
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-01-11 05:55:09 +00:00
364f526b9c [Inductor] assert generator for random, dropout (#91833)
See comment https://github.com/pytorch/pytorch/pull/90869#discussion_r1063731541 , https://github.com/pytorch/pytorch/pull/91673#discussion_r1061099337.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91833
Approved by: https://github.com/jansel
2023-01-11 03:24:10 +00:00
554a796aef Implement torch._foreach_lerp (#87562)
As per title.

- [ ] ~~Q: Do we want `torch._foreach_lerp.ScalarList` as well?~~
- [ ] ~~we might want to have `ATen/native/cuda/lerp.cuh` and include it in `ATen/native/cuda/Lerp.cu` and `ATen/native/cuda/ForeachTernaryOp.cu`~~

Related:
- https://github.com/pytorch/pytorch/issues/58833
- https://github.com/pytorch/pytorch/issues/71683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87562
Approved by: https://github.com/ngimel
2023-01-11 02:52:04 +00:00
7c907bd829 Minor doc updates for S3 update procedure (#91978)
It would be good to make the s3_init_config.json instructions
more detailed (like step-by-step for how to run the custom build)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91978
Approved by: https://github.com/malfet
2023-01-11 02:36:29 +00:00
19723d754d [CUBLAS][TF32] Change cuBLAS TF32 environment variable to be initialization only (#85859)
CC @ptrblck @xwang233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85859
Approved by: https://github.com/ngimel
2023-01-11 02:03:11 +00:00
de4e4c785a [mergebot] Fix mergebot allow revert of codev diff (#91975)
mergebot was allowing non facebook-github-bot users to revert codev diffs when it shouldnt be allowed

Fixes https://github.com/pytorch/test-infra/issues/1381
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91975
Approved by: https://github.com/ZainRizvi, https://github.com/kit1980, https://github.com/malfet
2023-01-11 01:59:07 +00:00
6b542147a3 Make job names match BUILD_ENVIRONMENT (#91512)
test-times.json uses the job name as the key, but when looking up the the times in CI, the BUILD_ENVIRONMENT is used because we don't have a good way of getting the job name (it usually turns out to be just "test" or "build" instead of "linux-cuda..."), so having the job names match the BUILD_ENVIRONMENT is necessary for sharding to work

Another solution might be to make the lookup more robust or look up the job name similar to how we get the job id.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91512
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet
2023-01-11 01:56:18 +00:00
43050b8301 Revert "[Inductor] Added aten.uniform_ decomp (#90869)"
This reverts commit c55293d64099ac4380f5e3955a891d1d7924f327.

Reverted https://github.com/pytorch/pytorch/pull/90869 on behalf of https://github.com/huydhn due to Crossref error cannot just simply be ignored because it would break trunk for every commits after this, i.e. fd0030fe74.  The failure would need to be handled gracefully, i.e. adding an XFAIL for example
2023-01-11 01:18:11 +00:00
4dcb10e027 Add missing clang-tidy fixes for modernize-use-equals-(default|delete) (#91857)
More clang-tidy for default or deleting more ctors and dtors. This is slightly more efficient and more readable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91857
Approved by: https://github.com/ezyang
2023-01-11 01:16:05 +00:00
b9a035c1c5 Add check-sparse-tensor-invariants flag to Context. (#90849)
This PR adds "check sparse tensor invariants" flag to Context that when enabled will trigger sparse tensor data invariants checks in unsafe methods of constructing sparse COO/CSR/CSC/BSR/BSC tensors. The feature includes the following changes to UI:

- `torch.enable_check_sparse_tensor_invariants` and `torch.is_check_sparse_tensor_invariants_enabled` functions to globally enable/disable the invariant checks and to retrieve the state of the feature, respectively
- `torch.sparse_coo/csr/csc/bsr/bsc/compressed_tensor` functions have a new optional argument `check_invariants` to enable/disable the invariant checks explicitly. When the `check_invariants` argument is specified, the global state of the feature is temporarily overridden.

The PR also fixes https://github.com/pytorch/pytorch/issues/90833

# Main issue

*The following content is outdated after merging the PRs in this ghstack but kept for the record.*

The importance of this feature is that when enabling the invariants checks by default, say, via

<details>

```
$ git diff
diff --git a/torch/__init__.py b/torch/__init__.py
index c8543057c7..19a91d0482 100644
--- a/torch/__init__.py
+++ b/torch/__init__.py
@@ -1239,3 +1239,8 @@ if 'TORCH_CUDA_SANITIZER' in os.environ:

 # Populate magic methods on SymInt and SymFloat
 import torch.fx.experimental.symbolic_shapes
+
+# temporarily enable sparse tensor arguments validation in unsafe
+# constructors:
+
+torch._C._set_check_sparse_tensor_invariants(True)
```

</details>

a massive number of test failures/errors occur in test_sparse_csr.py tests:
```
$ pytest -sv test/test_sparse_csr.py
<snip>
==== 4293 failed, 1557 passed, 237 skipped, 2744 errors in 69.71s (0:01:09) ====
```
that means that we are silently constructing sparse compressed tensors that do not satisfy the sparse tensor invariants. In particular, the following errors are raised:

```
AssertionError: "resize_as_sparse_compressed_tensor_: self and src must have the same layout" does not match "expected values to be a strided and contiguous tensor"

RuntimeError: CUDA error: device-side assert triggered

RuntimeError: `col_indices[..., crow_indices[..., i - 1]:crow_indices[..., i]] for all i = 1, ..., nrows are sorted and distinct along the last dimension values` is not satisfied.

RuntimeError: expected col_indices to be a strided and contiguous tensor

RuntimeError: expected row_indices to be a strided and contiguous tensor

RuntimeError: expected values to be a strided and contiguous tensor

RuntimeError: for_each: failed to synchronize: cudaErrorAssert: device-side assert triggered

RuntimeError: tensor dimensionality must be sum of batch, base, and dense dimensionalities (=0 + 2 + 0) but got 3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90849
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-01-11 01:05:14 +00:00
949f25be0c [vmap] all, any : batching rule (#91966)
Fixes https://github.com/pytorch/functorch/issues/1060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91966
Approved by: https://github.com/srossross, https://github.com/zou3519
2023-01-11 00:45:51 +00:00
7c1c239db1 [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91575)
This reverts commit 94262efc7d381ace82aa74ed2f5f5ec76f8fca95 to reland #91105 / #90738.

Fixes https://github.com/pytorch/torchdynamo/issues/2015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91575
Approved by: https://github.com/ngimel
2023-01-11 00:08:03 +00:00
6912f7c564 Update references to 1.14 to 2.0 (#91769)
There won't be a 1.14 release, so these should be updated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91769
Approved by: https://github.com/Skylion007, https://github.com/svekars, https://github.com/lezcano
2023-01-10 23:42:07 +00:00
fd0030fe74 Fix indexing_dtype_strength_reduction (#91601)
Many of the previous inductive cases were wrong (e.g. `abs`, `sq`, `div` and `truediv`).
We rewrite it using the mathematical terms that allow to prove the relevant upper
and lower bounds.

Note that the inductive step can be seen as a not-too-difficult optimisation problem
with constraints, hence the naming of the functions.

For many of the other functions, we also simplify the formulas, which will be useful
when this code is generalised to work with symbolic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91601
Approved by: https://github.com/jansel, https://github.com/eellison
2023-01-10 23:39:30 +00:00
c55293d640 [Inductor] Added aten.uniform_ decomp (#90869)
Fixes #90815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90869
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano, https://github.com/ngimel, https://github.com/albanD
2023-01-10 23:05:01 +00:00
0a677f2335 [MPS] Add testcase for copying cpu tensors into strided mps tensors (#91784)
Fixes https://github.com/pytorch/pytorch/issues/86975

If the destination is a strided MPS tensor and the source is a CPU tensor, we cannot perform a blit directly to copy the memory from the CPU tensor into the MPS tensor. We need to scatter the data into the right indices.
```
        a1 = torch.Tensor([[1,2],[3,4], [5,6]]).to(torch.device("mps"))
        b1 = torch.Tensor([-1, -1])
        a1[1:,1] = b1  # strided MPS destination / contiguous CPU source
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91784
Approved by: https://github.com/kulinseth
2023-01-10 22:45:48 +00:00
09c2b2af53 [MPS] Solve contiguos view tensors using arrayViews instead of blits (#146) (#91743)
Solve contiguous view tensors using arrayViews directly instead of performing blit or gather.

E.g in case of the following example:
```
x = torch.tensor([1,2,3,4], device="mps')
y = x[2:]
r = y + 2
```
Previously, `y` would be materialized using a gather or a blit. With this change, the memory of `y` is aliased directly using arrayViews, thus skipping the need for blit or gather.

Fixes pytorch#85297, pytorch#86048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91743
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-01-10 22:39:29 +00:00
4f91b8e0ee Fix typo under docs directory (#91871)
This PR fixes typo in '.rst' files under 'docs' directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91871
Approved by: https://github.com/ngimel
2023-01-10 22:33:36 +00:00
645fb217c0 optimize sampled_addmm performance on CPU (SparseCSR) (#90978)
### Target and Background
This PR is improving the performance of `sampled_addmm` on CPU device. This is part of effort for improving PyG performance on CPU for GNN training/inference.

The current implementation is a reference design which converts `SparseCSR` tensor back to dense tensor and then do the addmm and convert back to `SparseCSR` again: this is going to be very slow and won't be able to run most of the datasets under https://github.com/snap-stanford/ogb (convert to dense would trigger `OOM`).

### Benchmarks

Right now we don't have any hands-on benchmark or workload to test this since this operator is not used in PyG yet. I fetched the dataset from `ogb-products` where:

* number of nodes: 2.4 * 10^6
* number of edges: 1.26 * 10^8
* number of features: 128

So if we store the **adjacency matrix** is dense, it is going to be 2.4 * 2.4 * 4 * 10^12 bytes, this will be OOB on current code. I abstract the first 1k rows to compare, **1100x** speedup:

CPU: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, dual socket, 20 cores per socket.
```
### before: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1212.000 ms!

### after: run 1000 rows from the whole dataset
sampled_addmm: running dataset ogb-products first 1000 rows: each iter takes 1.102 ms!

### after: run the whole dataset
sampled_addmm: running dataset ogb-products (the whole dataset) 2449029 rows: each iter takes 873.306 ms!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90978
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-01-10 22:13:35 +00:00
3aeb7127b4 Revert "Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#91600)"
This reverts commit 370df963e062d8eb409d4426dd59b3f0cac8c3d1.

Reverted https://github.com/pytorch/pytorch/pull/91600 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-01-10 21:38:40 +00:00
323e0143d6 [Op Benchmark] Add Pointwise Conv2d Op Benchmark (#91918)
@bypass-github-export-checks

Pointwise Conv2d is one of the ops which we want to benchmark using different Vulkan Shaders (```conv2d_pw_2x2``` vs ```conv2d_pw_1x1```) with

The configs are copied from Conv2d with the kernel parameter removed.

I considered just using the same configs but ignoring the provided kernel and hardcoding the kernel to 1 when initializing nn.Conv2d, but then in the op benchmark title, it would say kernel=3 even if though that would not be the case.

Differential Revision: [D42303453](https://our.internmc.facebook.com/intern/diff/D42303453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91918
Approved by: https://github.com/mcr229
2023-01-10 21:36:37 +00:00
a6e2d76bb9 [Vulkan] Add Override Mechanism to Shader Registry (#91917)
@bypass-github-export-checks

Setting overrides in the Vulkan Shader Registry will be used with the Op Benchmark Tool to benchmark different shaders on different devices.

Differential Revision: [D41738945](https://our.internmc.facebook.com/intern/diff/D41738945/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91917
Approved by: https://github.com/mcr229
2023-01-10 21:34:26 +00:00
3d6f85c936 [Vulkan] Enable Codegen ShaderInfo Registry from GLSLT + Params YAML files (conv2d_pw) (#91916)
@bypass-github-export-checks

This diff allows for adding entries to the shader registry by specifying which op names and registry keys should map to a template codegen Shader in the codegen Shader's glslt and params yaml files.

This can be done by
- adding a REGISTER_FOR entry which maps to either a tuple of (op name, list of registry keys) or null to the YAML file, and
- adding a ```REGISTER_FOR = $REGISTER_FOR``` line to the ShaderInfo comment in the glslt file

Ex.

YAML File:
```
conv2d_pw:
  parameter_names_with_default_values:
      ...
      REGISTER_FOR:
        - !!python/tuple ["conv2d_pw", ["catchall"]]
  parameter_values:
    - ...
      REGISTER_FOR: null
```
GLSLT File:
```
...
 * REGISTER_FOR = $REGISTER_FOR
...
```

This diff also registers the conv2d_pw_2x2 Shader under ```'conv2d_pw → 'catchall'``` in the registry and uses ```VK_REGISTRY_KERNEL``` to retrieve the shader by look up in the registry

The shader registry generated in spv.cpp now looks like
```
ShaderRegistry shader_registry = {
        {"conv2d", {{"catchall", "conv2d"}}},
        {"conv2d_pw", {{"catchall", "conv2d_pw_2x2"}}}};
```

and the generated conv2d_p2_KxK.glsl files look like:
K=1
```
...
/*
 * TILE_SIZE = (1, 1, 1)
 * WEIGHT_STORAGE = TEXTURE_2D
 * WEIGHT_STORAGE_LAYOUT = OC4,IC4,4ic,4oc
 * BIAS_STORAGE = TEXTURE_2D
 * REGISTER_FOR = None
 */
...
```
K=2
```
...
/*
 * TILE_SIZE = (2, 2, 1)
 * WEIGHT_STORAGE = TEXTURE_2D
 * WEIGHT_STORAGE_LAYOUT = OC4,IC4,4ic,4oc
 * BIAS_STORAGE = TEXTURE_2D
 * REGISTER_FOR = ('conv2d_pw', ['catchall'])
 */
...
```

Differential Revision: [D42198560](https://our.internmc.facebook.com/intern/diff/D42198560/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91916
Approved by: https://github.com/mcr229
2023-01-10 21:32:23 +00:00
67d401d1be [Vulkan] Enable Codegen ShaderInfo Registry from GLSL files (conv2d) (#91915)
@bypass-github-export-checks

This diff allows for adding entries to the shader registry by specifying which op names and registry keys should map to a Shader in the Shader's glsl file.

This can be done by adding a REGISTER_FOR line with a tuple of (op name, list of registry keys) to the ShaderInfo comment in the glsl file

Ex.
```
REGISTER_FOR = ('conv2d', ['catchall', ...])
```

This diff also registers the conv2d Shader under ```'conv2d → 'catchall'``` in the registry and uses ```VK_REGISTRY_KERNEL``` to retrieve the shader by look up in the registry

The shader registry generated in spv.cpp now looks like
```
ShaderRegistry shader_registry = {
        {"conv2d", {{"catchall", "conv2d"}}}};
```

Differential Revision: [D42197400](https://our.internmc.facebook.com/intern/diff/D42197400/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91915
Approved by: https://github.com/mcr229
2023-01-10 21:25:16 +00:00
0c3ed2ed22 [dynamo] Support dynamic slicing (#91341)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91341
Approved by: https://github.com/voznesenskym
2023-01-10 21:23:55 +00:00
3139e687db [Vulkan] Add Basic Shader Registry (#91914)
@bypass-github-export-checks

We want to be able to look-up which shader to use in a registry given a particular op/algorithm name, which is what this diff enables. This is done with the newly added ```shader_registry``` map and ```look_up_shader_info``` function.

After this change, Shaders can be retrieved either with the ```VK_KERNEL``` macro, which gets the Shader with a specified name directly, or with the ```VK_REGISTRY_KERNEL``` macro, which looks up what Shader should be used for a specified algorithm name in the registry.

For now, the registry is empty and unused. In the next diffs in this stack, I will be adding support for registering a shader in the registry in GLSL and GLSLT + Params Yaml files.

I also
- Adjusted the formatting of spv.h and spv.cpp so that they are closer to what clang wants, which makes them easier to read. (proper indentation, proper order of includes, etc.)
- Moved the codegen spv/registry code from at::native::vulkan to at::native::vulkan::api (since registry.cpp / .h are in ```ATen/native/vulkan/api```)

Now spv.h looks like
```
#pragma once
#include <ATen/native/vulkan/api/Types.h>
#include <ATen/native/vulkan/api/vk_api.h>
#include <c10/util/flat_hash_map.h>
#include <string>
namespace at {
namespace native {
namespace vulkan {
namespace api {
struct ShaderInfo;
} // namespace api
typedef ska::flat_hash_map<std::string, api::ShaderInfo> ShaderListing;
typedef ska::flat_hash_map<std::string, std::string> RegistryKeyMap;
typedef ska::flat_hash_map<std::string, RegistryKeyMap> ShaderRegistry;
extern const ShaderListing shader_infos;
extern ShaderRegistry shader_registry;
inline const ShaderListing& get_shader_infos() {
  return shader_infos;
}
inline ShaderRegistry& get_shader_registry() {
  return shader_registry;
}
} // namespace vulkan
} // namespace native
} // namespace at
```
and spv.cpp looks like
```
#include <ATen/native/vulkan/api/Shader.h>
#include <ATen/native/vulkan/spv.h>
#include <stdint.h>
#include <vector>
namespace at {
namespace native {
namespace vulkan {
namespace {
const uint32_t adaptive_avg_pool2d_bin[] = {
  119734787,
  ...
};
...
const uint32_t conv2d_pw_2x2_bin[] = {
  119734787,
  ...
};
} // namespace
const ShaderListing shader_infos = {
    {"adaptive_avg_pool2d",
     api::ShaderInfo(
         "vulkan.adaptive_avg_pool2d",
         adaptive_avg_pool2d_bin,
         3204,
         {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,
          VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,
          VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER},
         std::vector<uint32_t>(),
         api::StorageType::UNKNOWN,
         api::StorageType::UNKNOWN)},
    ...
    {"conv2d_pw_2x2",
     api::ShaderInfo(
         "vulkan.conv2d_pw_2x2",
         conv2d_pw_2x2_bin,
         7736,
         {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE,
          VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,
          VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,
          VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,
          VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER},
         {2, 2, 1},
         api::StorageType::TEXTURE_2D,
         api::StorageType::TEXTURE_2D)}};
ShaderRegistry shader_registry = {
};
} // namespace vulkan
} // namespace native
} // namespace at

```
(Full File: P594112814)

Differential Revision: [D41594453](https://our.internmc.facebook.com/intern/diff/D41594453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91914
Approved by: https://github.com/mcr229
2023-01-10 21:13:17 +00:00
cd62ad5f88 [Vulkan] Enable including GLSL files from custom locations in gen_vulkan_spv (#91913)
@bypass-github-export-checks

To include custom locations when building with buck, use a ```-c gen_vulkan_spv.additional_glsl_paths="..."``` flag where ... is a list of filegroups and source directory paths separated by spaces,

ex. to include the sources added in D41413913, you would use

```
buck build ... -c gen_vulkan_spv.additional_glsl_paths="//xplat/caffe2:test_glsl_src_path_a test_src/a //xplat/caffe2:test_glsl_src_path_b test_src/b"
```

(as shown in the test plan)

Differential Revision: [D41413914](https://our.internmc.facebook.com/intern/diff/D41413914/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41413914/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91913
Approved by: https://github.com/mcr229
2023-01-10 20:35:23 +00:00
ec94cbc66a [Vulkan] Remove GLSL Code Gen (#91912)
@bypass-github-export-checks

GLSL Code Gen is not used, so this diff removes
- GLSL parts of ShaderSource
- Anything enclosed by USE_VULKAN_SHADERC_RUNTIME, as well as the flag itself
- gen_vulkan_glsl script

Plus some additional refactoring

Differential Revision: [D41358861](https://our.internmc.facebook.com/intern/diff/D41358861/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41358861/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91912
Approved by: https://github.com/mcr229
2023-01-10 20:29:47 +00:00
28eb3c8faf [Vulkan] Generate ShaderInfos Directly via Codegen in gen_vulkan_spv (#91911)
@bypass-github-export-checks

Before this change, we have the data members which make up a ```ShaderInfo``` sitting in ```spv.h/.cpp``` in an unorganized manner. This diff makes the change such that the ```ShaderInfo```s are initialized directly in spv.h/.cpp

Now spv.h looks like
```
#pragma once
#include <stdint.h>
#include <vector>
#include <string>
#include <ATen/native/vulkan/api/Types.h>
#include <ATen/native/vulkan/api/vk_api.h>
namespace at {
namespace native {
namespace vulkan {
namespace api {
struct ShaderInfo;
} // namespace api
extern const api::ShaderInfo adaptive_avg_pool2d_spv;
...
extern const api::ShaderInfo conv2d_pw_2x2_spv;
} // namespace vulkan
} // namespace native
} // namespace at
```
(Full File: P557399150)
and spv.cpp looks like
```
#include <ATen/native/vulkan/spv.h>
#include <ATen/native/vulkan/api/Shader.h>
namespace at {
namespace native {
namespace vulkan {
namespace {
const uint32_t adaptive_avg_pool2d_spv_bin[] = {
  119734787,
  ...
};
...
const uint32_t conv2d_pw_2x2_spv_bin[] = {
  119734787,
  ...
};
} // namespace
const api::ShaderInfo adaptive_avg_pool2d_spv(
  "vulkan.adaptive_avg_pool2d",
  adaptive_avg_pool2d_spv_bin,
  3204,
  {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER},
  std::vector<uint32_t>(),
  api::StorageType::UNKNOWN,
  api::StorageType::UNKNOWN
);
...
const api::ShaderInfo conv2d_pw_2x2_spv(
  "vulkan.conv2d_pw_2x2",
  conv2d_pw_2x2_spv_bin,
  7736,
  {VK_DESCRIPTOR_TYPE_STORAGE_IMAGE, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER},
  {2, 2, 1},
  api::StorageType::TEXTURE_2D,
  api::StorageType::TEXTURE_2D
);
} // namespace vulkan
} // namespace native
} // namespace at

```
(Full File: P584237146)

Differential Revision: [D41354313](https://our.internmc.facebook.com/intern/diff/D41354313/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91911
Approved by: https://github.com/mcr229
2023-01-10 20:22:30 +00:00
776fef9ecc [Vulkan] Merge ShaderSource into ShaderInfo (#91910)
@bypass-github-export-checks

```ShaderInfo``` was added by Kimish in D40280338 to be an extension of ```ShaderSource``` with extra fields. In this diff, I merge the two into one struct, using the combined struct in place of wherever either of the two was used before

Differential Revision: [D41197273](https://our.internmc.facebook.com/intern/diff/D41197273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91910
Approved by: https://github.com/mcr229
2023-01-10 20:19:11 +00:00
370df963e0 Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation (#91600)
Summary:
X-link: https://github.com/facebookresearch/d2go/pull/452

Remove MobileOptimizerType and all rewrite flags from torch.X and torch._C.X to clean up torch.X and torch._C.X namespaces

The affected rewrite flags are
- CONV_BN_FUSION
- FUSE_ADD_RELU
- HOIST_CONV_PACKED_PARAMS
- INSERT_FOLD_PREPACK_OPS
- REMOVE_DROPOUT
- VULKAN_AUTOMATIC_GPU_TRANSFER

Bc-Breaking Change:

Before this change, the rewrite flags were accessible through all of
1. torch.utils.mobile_optimizer.MobileOptimizerType.X
2. torch._C.MobileOptimizerType.X
3. torch.X
4. torch.MobileOptimizerType.X
5. torch._C.X

But after this change, only torch.utils.mobile_optimizer.MobileOptimizerType.X  (option 1 above) and the newly added torch._C._MobileOptimizerType.X remain

Corresponding updates to PyTorch Tutorial Docs are in https://github.com/pytorch/tutorials/pull/2163

Test Plan:
```buck test caffe2/test:test_mobile_optimizer```
```
Summary
  Pass: 6
  Skip: 1
    ↻ caffe2/test:test_mobile_optimizer - test_mobilenet_optimize_for_mobile (test_mobile_optimizer.TestOptimizer)
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4222124793514412
```
___

With temporary testing changes in D41690204:

```buck run caffe2:test_rewrite_flags_api```
Before:
```
torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch._C' has no attribute '_MobileOptimizerType')
torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
```
After:
```
torch.utils.mobile_optimizer.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C._MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result: 
torch._C.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch._C' has no attribute 'MobileOptimizerType')
torch.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER')
torch.MobileOptimizerType.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch' has no attribute 'MobileOptimizerType')
torch._C.VULKAN_AUTOMATIC_GPU_TRANSFER
        Expected:  | Result:  (module 'torch._C' has no attribute 'VULKAN_AUTOMATIC_GPU_TRANSFER')
```

```buck test caffe2/test:public_bindings -- test_no_new_bindings```
```
Summary
  Pass: 1
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/7881299473114294
```

Differential Revision: D41690203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91600
Approved by: https://github.com/albanD, https://github.com/malfet
2023-01-10 20:16:53 +00:00
e9cd7e0869 [dynamo] Fix rst syntax for list (#90390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90390
Approved by: https://github.com/soumith
2023-01-10 19:56:26 +00:00
7f2b5ea1e1 Revert "Avoid device casting for all singleton tensors in optimizer states (#91454)"
This reverts commit 1e725c97470d8cf74e85984ca997e77c76e91a18.

Reverted https://github.com/pytorch/pytorch/pull/91454 on behalf of https://github.com/janeyx99 due to Likely caused regression where checkpoint resume fails during training
2023-01-10 18:57:50 +00:00
e0b82d7d1f [MPS] Fix convolution `Source and weight input channels mismatch' crash (#91822)
Fixes crashes in conv input/weight backward passes due to NCHW / NHWC formats.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91822
Approved by: https://github.com/razarmehr
2023-01-10 18:30:18 +00:00
cdc30048e5 Fix numel() result after resizing a sparse compressed tensor. (#91831)
Fixes #91830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91831
Approved by: https://github.com/cpuhrsch
2023-01-10 18:21:07 +00:00
ce50a8de75 [CI][ROCm] add test_dataloader to CI_SERIAL_LIST (#91895)
Still working towards solving #90940 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91895
Approved by: https://github.com/huydhn
2023-01-10 16:32:39 +00:00
1892c75a45 fix norrow_copy correctness issue for non-contiguous input for cpu path(reland) (#91883)
This PR is about re-land https://github.com/pytorch/pytorch/pull/91789.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91883
Approved by: https://github.com/lezcano
2023-01-10 10:56:18 +00:00
d1cc64b2ac [primTorch] Fix masking in logsumexp ref (#91941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91941
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-01-10 10:55:04 +00:00
498be7ed25 Revert "Refactor stack_trace preservation for node meta preservation (#90803)"
This reverts commit 0f1302eeaed3b10ab6db493c1c33797a6ec46866.

Reverted https://github.com/pytorch/pytorch/pull/90803 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-01-10 10:44:28 +00:00
c887837ec3 Reland "Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463)" (#91897)
This reverts commit 84266ae6701c95fd76b50101e07981b1ef6dfe33.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91897
Approved by: https://github.com/ngimel
2023-01-10 08:16:07 +00:00
3726d23219 Torch package support in dynamo (#91821)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91821
Approved by: https://github.com/suo, https://github.com/malfet
2023-01-10 06:53:15 +00:00
ae2e755f15 RM (unused?) has_mutation (#91931)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91931
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-01-10 06:50:37 +00:00
32e9b29ce9 [pruning][core][feature] Add in SaliencyPruner to pruner._experimental (#91814)
Summary:

This PR adds in SaliencyPruner, an implementation of L1 norm pruning for structured pruning, as well as additional tests for the SaliencyPruner
The README.md references this file but I forgot to add it in earlier when writing the tutorial.

Test Plan:
```
python test/test_ao_sparsity.py -- TestSaliencyPruner
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91814
Approved by: https://github.com/jerryzh168
2023-01-10 04:04:55 +00:00
42a63a7ed9 Dynamo.export uses dynamic=True for symbolic tracing (#91899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91899
Approved by: https://github.com/ezyang
2023-01-10 01:12:22 +00:00
4919e11900 fixing test_batch_norm_implicit_dtype_promotion (__main__.TestNvFuserDynamo) (#91541)
patches the missing pin_memory argument on full::meta_impl. This is not a functional break, but it does give test failure, which asserts on no warning. vvv
`python test/test_nvfuser_dynamo.py -k test_batch_norm_implicit_dtype_promotion`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91541
Approved by: https://github.com/malfet
2023-01-10 00:31:36 +00:00
67f965b15a Add Skylion007 as a core reviewer (#91890)
Skylion007 has been diligently improving the state of our C++ code
to follow best practices and make it possible to run lint on it (at the
moment the code is so messy it cannot be linted), and I would like to
give him review permissions to facilitate this work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91890
Approved by: https://github.com/Skylion007, https://github.com/soumith
2023-01-10 00:27:18 +00:00
b0f359a3c9 Disable win vs2019 cpu build+test until we figure out the linker crash (#91932)
`win-vs2019-cpu-py3 / build` builds are failing consistently right now with a linker crash.  Tracked by sev https://github.com/pytorch/pytorch/issues/91933

Disable those workflows for to mitigate the damage until we figure out a root cause

Example: [win-vs2019-cpu-py3 / build](https://github.com/pytorch/pytorch/actions/runs/3877976332/jobs/6614897752#logs)

Exact error:
```
FAILED: bin/torch_python.dll lib/torch_python.lib
cmd.exe /C "cd . && C:\Jenkins\Miniconda3\Library\bin\cmake.exe -E vs_link_dll --intdir=caffe2\torch\CMakeFiles\torch_python.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100190~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp  /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO  /NODEFAULTLIB:LIBCMT.LIB  -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib  && cd ."
LINK: command "C:\PROGRA~2\MICROS~2\2019\BUILDT~1\VC\Tools\MSVC\1428~1.293\bin\Hostx64\x64\link.exe /nologo @CMakeFiles\torch_python.rsp /out:bin\torch_python.dll /implib:lib\torch_python.lib /pdb:bin\torch_python.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /NODEFAULTLIB:LIBCMT.LIB -WHOLEARCHIVE:C:/actions-runner/_work/pytorch/pytorch/build/lib/onnx.lib /MANIFEST /MANIFESTFILE:bin\torch_python.dll.manifest" failed (exit code 0) with the following output:

LINK : fatal error LNK1000: Internal error during CImplib::EmitImportThunk
Access violation
ninja: build stopped: subcommand failed.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91932
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-01-10 00:21:08 +00:00
138a0188e0 Add support for logaddexp(float16) in CUDA and implement its reference (#91869)
The reference is implemented so that it generates efficient and
numerically stable triton code.

Fixes https://github.com/pytorch/pytorch/issues/91683

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91869
Approved by: https://github.com/ngimel
2023-01-10 00:19:24 +00:00
df3adbd521 Pin onnx-script to a version before they bumped numpy (#91929)
Onnx PR https://github.com/microsoft/onnx-script/pull/289 expects numpy to be upgraded. That breaks pytorch's onnx builds.

For now, mitigate by pinning the script to an older version until there's a proper solution
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91929
Approved by: https://github.com/justinchuby, https://github.com/kit1980
2023-01-10 00:00:37 +00:00
0f1302eeae Refactor stack_trace preservation for node meta preservation (#90803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90803
Approved by: https://github.com/jerryzh168, https://github.com/albanD
2023-01-09 23:23:27 +00:00
1e768c63c1 Add merged label to ghstack prs (#90238)
not very elegant

other option might be adding something to pytorchbot to listen to push events for master?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90238
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-01-09 22:49:20 +00:00
32356aaee6 [4/N] Add test for partial training for NamedOptimizer (#91344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91344
Approved by: https://github.com/rohan-varma
2023-01-09 22:19:49 +00:00
26beb46da4 Reduce #iters to make test run always (#91837)
Summary: Reduce #iters to make test run always

Test Plan: sandcastle

Reviewed By: drisspg

Differential Revision: D42397999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91837
Approved by: https://github.com/drisspg
2023-01-09 21:38:18 +00:00
95e3e339a8 Add log_once to fused attention kernels (#91858)
# Summary
Adding log once to track usage statistics of the fused attention kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91858
Approved by: https://github.com/cpuhrsch
2023-01-09 20:59:26 +00:00
333540a458 Reland "Add torch.utils.device_mode" (#91796)
Original PR https://github.com/pytorch/pytorch/pull/91525

Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91796
Approved by: https://github.com/albanD
2023-01-09 20:57:12 +00:00
9d20d6d5ec Foreach clamp_min clamp_max (#91384)
Adds `_foreach_clamp_min` and `_foreach_clamp_max` as binary ops, with scalar, scalarlist and tensorlist support.

Timing example for `_foreach_clamp_min_` on a GTX3070Ti across a list of tensors with varying count and item size (times are in microseconds (us)):

CUDA:

```
[------------------ (tensors, scalar) -------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |     29.0   |     10.2
      100 tensors of size 4        |    234.4   |     18.3
      1000 tensors of size 4       |   2194.1   |    113.5
      10000 tensors of size 4      |  21745.6   |   1144.5
      10 tensors of size 16        |     29.5   |     12.0
      100 tensors of size 16       |    256.9   |     19.9
      1000 tensors of size 16      |   2499.7   |    123.6
      10000 tensors of size 16     |  25022.2   |   1295.6
      10 tensors of size 256       |     32.8   |     11.2
      100 tensors of size 256      |    258.8   |     19.7
      1000 tensors of size 256     |   2509.2   |    123.7
      10000 tensors of size 256    |  25016.2   |   1295.4
      10 tensors of size 65536     |     32.9   |     18.7
      100 tensors of size 65536    |    327.1   |    150.3
      1000 tensors of size 65536   |   3051.3   |   1388.0
      10000 tensors of size 65536  |  30476.9   |  14021.5

[------------------ (tensors, tensors) ------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |     26.8   |     17.3
      100 tensors of size 4        |    206.8   |     90.5
      1000 tensors of size 4       |   1993.0   |    828.9
      10000 tensors of size 4      |  19851.0   |   9063.3
      10 tensors of size 16        |     34.7   |     20.0
      100 tensors of size 16       |    232.2   |    102.1
      1000 tensors of size 16      |   2220.9   |    977.3
      10000 tensors of size 16     |  22644.5   |  10361.4
      10 tensors of size 256       |     30.5   |     19.7
      100 tensors of size 256      |    231.6   |    102.4
      1000 tensors of size 256     |   2251.9   |    978.7
      10000 tensors of size 256    |  22680.3   |  10405.8
      10 tensors of size 65536     |     30.6   |     34.4
      100 tensors of size 65536    |    315.1   |    223.6
      1000 tensors of size 65536   |   3252.1   |   2114.4
      10000 tensors of size 65536  |  30578.0   |  22826.3

```

CPU:
```
[------------------- (tensors, scalar) -------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |      13.0  |       9.6
      100 tensors of size 4        |      62.4  |      31.6
      1000 tensors of size 4       |     562.2  |     245.6
      10000 tensors of size 4      |    5552.2  |    2517.7
      10 tensors of size 16        |      14.9  |      11.3
      100 tensors of size 16       |      74.1  |      36.9
      1000 tensors of size 16      |     663.7  |     285.5
      10000 tensors of size 16     |    6765.2  |    2947.5
      10 tensors of size 256       |      15.2  |      11.8
      100 tensors of size 256      |      76.0  |      37.7
      1000 tensors of size 256     |     728.8  |     323.9
      10000 tensors of size 256    |    7274.4  |    3800.3
      10 tensors of size 65536     |     105.6  |     124.5
      100 tensors of size 65536    |     982.8  |     939.7
      1000 tensors of size 65536   |   14993.1  |   14579.2
      10000 tensors of size 65536  |  163091.0  |  151555.8

[------------------- (tensors, tensors) ------------------]
                                   |  for loop  |  foreach
      10 tensors of size 4         |      11.8  |      10.5
      100 tensors of size 4        |      53.1  |      38.2
      1000 tensors of size 4       |     465.1  |     316.1
      10000 tensors of size 4      |    4616.9  |    3625.9
      10 tensors of size 16        |      13.5  |      12.3
      100 tensors of size 16       |      63.0  |      46.5
      1000 tensors of size 16      |     560.1  |     359.9
      10000 tensors of size 16     |    5586.8  |    3765.9
      10 tensors of size 256       |      15.2  |      13.7
      100 tensors of size 256      |      64.4  |      48.3
      1000 tensors of size 256     |     653.7  |     410.0
      10000 tensors of size 256    |    5916.6  |    3901.3
      10 tensors of size 65536     |     109.1  |     106.8
      100 tensors of size 65536    |    1128.9  |    1105.0
      1000 tensors of size 65536   |   16245.0  |   15950.8
      10000 tensors of size 65536  |  171111.3  |  163540.2
```

Example use:

```
tensors = [torch.randn(16, device='cuda') for _ in range(10)]

out = torch._foreach_clamp_min(tensors, 0.1)
out = torch._foreach_clamp_min(tensors, [0.1] * len(tensors))
out = torch._foreach_clamp_min(tensors, tensors)
torch._foreach_clamp_min_(tensors, 0.1)
torch._foreach_clamp_min_(tensors, [0.1] * len(tensors))
torch._foreach_clamp_min_(tensors, tensors)
```

Does not support complex types.
Changes the existing `foreach_minimum/maximum` to use this new implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91384
Approved by: https://github.com/ngimel
2023-01-09 19:28:47 +00:00
b2646dcb65 Consider updating pin to CMake version 3.23.1 on run_torchbench CI (#91739)
### Issues Affected
Fixes #74985 and #75705

### Description
Unpinned cmake initially used version 3.23.0 which broke a build https://github.com/pytorch/pytorch/issues/74985#issue-1187048138. This previously led to a necessary pin to cmake version 3.22. With the release of cmake 3.23.1, it is no longer necessary to pin cmake https://github.com/pytorch/pytorch/issues/74985#issuecomment-1102355302.

CMake unpin change has not been added to `.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat` because Windows dependencies were refactored away https://github.com/pytorch/pytorch/pull/88862.

### How is this Tested?
This change is tested using "RUN_TORCHBENCH:" in the PR body https://github.com/pytorch/pytorch/pull/77577#issuecomment-1128048251.

### People with Relevant Context
@janeyx99

RUN_TORCHBENCH:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91739
Approved by: https://github.com/huydhn
2023-01-09 19:07:29 +00:00
d4aa807ba9 Enable bfloat16 for hardtanh_backward_cuda (#91511)
I'm not sure why this was left out in the first place as all adjacent operations have both Half and BFloat16. Things seem to work as expected and this enables `relu6` to be used in bfloat16 training. Hardtanh backward is super simple and precision is not relevant.

```
import torch
x_fp32 = torch.tensor([-1,2,4,7], requires_grad=True, dtype=torch.float32, device="cuda")
x_bf16 = torch.tensor([-1,2,4,7], requires_grad=True, dtype=torch.bfloat16, device="cuda")
torch.nn.functional.relu6(x_fp32).sum().backward()
torch.nn.functional.relu6(x_bf16).sum().backward()
assert (x_fp32.grad == x_bf16.grad).all()
```

Previously would fail with:
```
Traceback (most recent call last):
  File "test_hardtanh_patch.py", line 5, in <module>
    torch.nn.functional.relu6(x_bf16).sum().backward()
  File ".../lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File ".../lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: "hardtanh_backward_cuda" not implemented for 'BFloat16'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91511
Approved by: https://github.com/ngimel
2023-01-09 18:50:28 +00:00
630ef6c711 Fix Dynamo+DDP documentation (#91832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91832
Approved by: https://github.com/soumith, https://github.com/davidberard98
2023-01-09 17:35:49 +00:00
e67f5ab6cc Print and zip remaining test logs (#91510)
When CI times out or gets cancelled, the code to print and delete logs for currently running tests doesn't get run, which makes it hard to debug what's going on, so print the logs in a new step and also zip them into the usage-log zip (which should probably get a name change at some point)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91510
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-09 17:31:36 +00:00
00e5f3a9c5 [primTorch] Move logsumexp decomp to refs (#91860)
Fixes #91843.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91860
Approved by: https://github.com/lezcano
2023-01-09 17:00:43 +00:00
84266ae670 Revert "Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463)"
This reverts commit 9945a78a94bd9907c05b102984c7233faa44ad14.

Reverted https://github.com/pytorch/pytorch/pull/90463 on behalf of https://github.com/ZainRizvi due to This is causing test failures: FAILED inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_pinv_singular_cuda_float64 - RuntimeError: unexpected success linalg.pinv.singular, torch.float64, cuda
2023-01-09 16:43:36 +00:00
f6c7cf1bf5 Revert "Torch package support in dynamo (#91821)"
This reverts commit eeb3e49ed46803dc5d62b306df128b66db14f901.

Reverted https://github.com/pytorch/pytorch/pull/91821 on behalf of https://github.com/malfet due to According to minihud broke misc tests, see eeb3e49ed4
2023-01-09 14:39:14 +00:00
39524f20de [functorch] excise remaining functorch imports from examples (#91282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91282
Approved by: https://github.com/zou3519
2023-01-09 14:35:21 +00:00
071756c9cf [functorch] rewrite examples that use make_functional to use functional_call (#88851)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88851
Approved by: https://github.com/zou3519
2023-01-09 14:35:21 +00:00
0ec3c5bc72 [MPS] Reduce ops multi axes support (#91734)
Currently, most of the reduction ops are flattening the input tensor to 1D to perform the operation.
This change removes the flattening of the tensors / the unranked placeholders and adds support for multi axes in all the reduction ops.

- Fixes reduction ops with correctness and shape issues.
- Fixes masked.argmax / masked.argmin. In case of passing inf to argmax / argmin, MPS will return nan as index for these numbers. Casting this nan to Long will make it -1. This change avoids negative values by clamping them to 0 (matching CPU results).

TestConsistency issues fixed:
```
std
var
amax
amin
sum
prod
mean
count_nonzero
masked.amax
masked.amin
masked.mean
masked.prod
masked.std
masked.sum
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91734
Approved by: https://github.com/kulinseth
2023-01-09 10:55:11 +00:00
fd213c3231 Match get_attr when compare node (#91657)
The pattern can't be matched if one attribute is `_param_constant1` and the other is `_param_constant0`

Large graph:
```
        # call_function  addmm_default      aten.addmm.default  (_param_constant1, ph_0, _tensor_constant0)  {}
```

Pattern graph
```
        # call_function  addmm_default      aten.addmm.default  (_param_constant0, ph_0, _tensor_constant0)  {}
```

Differential Revision: [D42316574](https://our.internmc.facebook.com/intern/diff/D42316574/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91657
Approved by: https://github.com/SherlockNoMad
2023-01-09 08:10:55 +00:00
fe80f190df use context manager for path extension in torch.hub (#75786)
We are using the idiom

```py
sys.path.insert(0, path)

# do something

sys.path.remove(path)
```

three times in `torch.hub`. This is a textbook case for using a context manager. In addition, by using `try` / `finally` we can enforce the Python path is back in its original state even if the actual action raises an exception:

```py
import sys

path = "/tmp"

# PR
try:
    sys.path.insert(0, path)
    try:
        # Any exception raised while performing the actual functionality
        raise Exception
    finally:
        sys.path.remove(path)
except Exception:
    assert path not in sys.path

# main
try:
    sys.path.insert(0, path)

    # Any exception raised while performing the actual functionality
    raise Exception

    sys.path.remove(path)
except Exception:
    assert path in sys.path
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75786
Approved by: https://github.com/NicolasHug
2023-01-09 07:08:35 +00:00
d85f3c8237 Revert "fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789)"
This reverts commit 136dadd689981a334985f2029f6d3e747c36da5c.

Reverted https://github.com/pytorch/pytorch/pull/91789 on behalf of https://github.com/huydhn due to This breaks trunk with XPASS test_vmap_exhaustive_narrow_copy_cpu_float32 136dadd689
2023-01-09 06:50:20 +00:00
9b415240d4 Revert "Reland "Add torch.utils.device_mode" (#91796)"
This reverts commit 81b5eff3c383f5308416e129861a2689d717702c.

Reverted https://github.com/pytorch/pytorch/pull/91796 on behalf of https://github.com/huydhn due to This breaks trunk with the following failed test https://hud.pytorch.org/failure/test_jit_save%2CTestTracer
2023-01-09 04:45:47 +00:00
9945a78a94 Fix dynamo handling for tensor attributes: T, H, mT, mH (#90463)
Fixes https://github.com/pytorch/pytorch/issues/88843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90463
Approved by: https://github.com/ngimel
2023-01-09 04:11:23 +00:00
3643b4ee4a fix sort crash when the input is expanded scalar (#91752)
fix https://github.com/pytorch/pytorch/issues/91420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91752
Approved by: https://github.com/ezyang
2023-01-09 02:02:56 +00:00
136dadd689 fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789)
Fix https://github.com/pytorch/pytorch/issues/91690.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91789
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-01-09 00:55:03 +00:00
8cec433cf2 Apply clang-tidy fixes to api/csrc/api/include/torch/nn (#91766)
Split off from #91559

Add move operations to missing shims / helper methods in torch/nn/functional
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91766
Approved by: https://github.com/soumith
2023-01-08 23:39:15 +00:00
f59845db40 Symintify pytorch slicing logic (#91340)
Differential Revision: [D42398023](https://our.internmc.facebook.com/intern/diff/D42398023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91340
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-08 22:51:42 +00:00
81b5eff3c3 Reland "Add torch.utils.device_mode" (#91796)
Original PR https://github.com/pytorch/pytorch/pull/91525

Signed-off-by: Edward Z. Yang <ezyangfb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91796
Approved by: https://github.com/albanD
2023-01-08 03:44:56 +00:00
eeb3e49ed4 Torch package support in dynamo (#91821)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91821
Approved by: https://github.com/suo
2023-01-08 01:46:24 +00:00
73e5379fab Apply clang-tidy perf fixes to aten (#91772)
Mostly just automated fixes to get rid of implicit copies. I also fixed on clang-tidy NOLINT comment that was in the wrong spot. Split off from #91559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91772
Approved by: https://github.com/soumith
2023-01-07 21:15:43 +00:00
2c00064113 remove unnecessary decomps (#91828)
in favor of refs. Generated triton code is the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91828
Approved by: https://github.com/lezcano, https://github.com/soumith
2023-01-07 20:37:12 +00:00
e3ed55d483 [ONNX] Add aten::zero support (#91731)
Fixes #90268

When we use `tensor.zero_()` with inplace slice, it actually uses `aten::zero` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91731
Approved by: https://github.com/BowenBao
2023-01-07 11:07:54 +00:00
0c1777acec Dynamo benchmark: add CPU specific changes (#88477)
This pr adds some CPU specific changes:

- Add support for IPEX backend
- https://github.com/pytorch/torchdynamo/issues/1618
- https://github.com/pytorch/torchdynamo/issues/1534
- Enable CPU launcher in runner.py.
- Fix the issue that some environment variables are not support on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88477
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-07 09:26:06 +00:00
75c652821c Assert valid base source for derivative sources (#91711)
We should not allow creating a derived source (e.g. AttrSource), without a valid base source.

It's more reliable to check this in the source `__init__` or `__post_init__` than asserting we have a valid source before passing that to an AttrSource() call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91711
Approved by: https://github.com/voznesenskym
2023-01-07 00:51:55 +00:00
edaba335b9 [primTorch] Use torch.fill to implement prims.fill (#91747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91747
Approved by: https://github.com/mruberry
2023-01-07 00:49:11 +00:00
faed4db497 [CI][ROCm] prune all stopped containers (#91815)
After #91740, stopped containers remained and consumed disk space.  Avoid no space left on device errors by removing all stopped containers any time we stop them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91815
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn
2023-01-07 00:41:59 +00:00
b32b81a0c5 Make torch.split take symint as arg (#91724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91724
Approved by: https://github.com/voznesenskym
2023-01-07 00:00:03 +00:00
08a378a286 Revert "[ONNX] Add aten::zero support (#91731)"
This reverts commit ff23508c0d491553dc8eea85fb45f49de52ca41f.

Reverted https://github.com/pytorch/pytorch/pull/91731 on behalf of https://github.com/clee2000 due to failing test_correct_module_names ff23508c0d https://github.com/pytorch/pytorch/actions/runs/3859079162/jobs/6578419644
2023-01-06 23:57:57 +00:00
a2c5efaf0f Un fold squeeze permute (#91656)
Fixes #91505

Hey, this should partially fix some of the problems discussed in the issue above-if I'm on the right track, I'll update this PR with more fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91656
Approved by: https://github.com/ngimel
2023-01-06 23:55:38 +00:00
5fabd96f3c [PT-D][3/N] Add FSDP hook with Named Optimizer (#91321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91321
Approved by: https://github.com/fegin
2023-01-06 23:51:33 +00:00
acab0edfab [ROCm] fix hipify mapping for cuDeviceGet (#90726)
The mapping was incorrect, but only certain downstream pytorch extensions found this issue.  pytorch CI does not cover this mapping.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90726
Approved by: https://github.com/pruthvistony, https://github.com/atalman
2023-01-06 22:57:44 +00:00
53ef96faae [MPS] Add support for randperm (#91708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91708
Approved by: https://github.com/kulinseth
2023-01-06 22:49:06 +00:00
ff23508c0d [ONNX] Add aten::zero support (#91731)
Fixes #90268

When we use `tensor.zero_()` with inplace slice, it actually uses `aten::zero` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91731
Approved by: https://github.com/BowenBao
2023-01-06 22:48:54 +00:00
766ebf4441 Remove hard numpy dependency introduced by inductor (#90796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90796
Approved by: https://github.com/ngimel, https://github.com/cpuhrsch
2023-01-06 22:36:38 +00:00
7cd951c21e Properly guard all numpy usage within dynamo and remove UnspecializedNumpyVariable (#90795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90795
Approved by: https://github.com/ngimel, https://github.com/cpuhrsch
2023-01-06 22:36:38 +00:00
f44946289b [CI][ROCm] fix device visibility, again (#91813)
The previous PR #91137 was incomplete.  Though it successfully queried for the number of available GPUs, it still resulted in test files sharing the same GPU.  This PR lifts the maxtasksperchild=1 restriction so that Pool workers will always use the same GPU.  This also adds a Note in run_test.py for future reference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91813
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet
2023-01-06 22:19:07 +00:00
4f1f14e38b [JIT] Skip builtins while enumerating class methods (#91805)
This is needed to support `enum.Enum` derived classes in Python-3.11
that adds `_new_member_` to classdict, see:
15c44789bb/Lib/enum.py (L529)

Following snippet illustrates the problem with the previous iteration of
the code on 3.11:
```python
from enum import Enum
import inspect

class Color(Enum):
    RED = 1
    GREEN = 2

def print_routines(cls):
    print(cls.__name__)
    for name in cls.__dict__:
        fn = getattr(cls, name)
        if inspect.isroutine(fn):
            print(name, fn, f"has_globals: {hasattr(fn, '__globals__')}")

print_routines(Color)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91805
Approved by: https://github.com/albanD, https://github.com/suo
2023-01-06 21:45:09 +00:00
69acc34083 Automatically convert real tensors to fake in dynamo export (#91742)
Summary: We don't care about params/buffers being mutated in dynamo export, so it is safe to always convert them to faketensor

Test Plan: CI

Differential Revision: D42353789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91742
Approved by: https://github.com/qihqi
2023-01-06 21:34:31 +00:00
ef495b7d64 make sure mutated args are iterated in the same order (#91792)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91792
Approved by: https://github.com/soumith
2023-01-06 20:46:07 +00:00
b3603f8129 Revert "Deduplicate c10 error and PyTorchError hierarchy (#87855)"
This reverts commit 34f2d3e6ae56744c20c2f859f97101dff291bbbc.

Reverted https://github.com/pytorch/pytorch/pull/87855 on behalf of https://github.com/osalpekar due to perf regression in quantization tests
2023-01-06 19:56:35 +00:00
f219970990 Return empty attention weights when need_atten_weights = False (#91782)
# Summary
This PR updates the second return value from SDPA to return an empty tensor of size 0 not what it would be if need_attn_weights is True. Also updates the meta function to account for this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91782
Approved by: https://github.com/cpuhrsch
2023-01-06 19:06:48 +00:00
f77a9a585c Add shape function for movedim op (#91696)
Signed-Off By: Vivek Khandelwal<vivek@nod-labs.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91696
Approved by: https://github.com/davidberard98
2023-01-06 18:24:52 +00:00
f556c5b979 Revert "[dynamo] Support dynamic slicing (#91341)"
This reverts commit 8e7dcd140ace26a7e3096a26fbeec9f572e9aaa7.

Reverted https://github.com/pytorch/pytorch/pull/91341 on behalf of https://github.com/clee2000 due to breaking various tests 8e7dcd140a https://github.com/pytorch/pytorch/actions/runs/3856936505/jobs/6574089745 marking this as weird because it was merged via codev?
2023-01-06 18:09:21 +00:00
f4b3b577d8 Docs push fix .netrc sometimes a directory (#91745)
Sometimes .netrc is a directory even though it's in the temp folder.

AFAIK there's nothing in the folder https://github.com/pytorch/pytorch/actions/runs/3842987245/jobs/6544919416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91745
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-06 17:46:02 +00:00
87164ace51 [MPS] Fix the ChannelsLast memory format in cat_out_mps() (#91786)
- Fixed the memory leak with the `malloc()`
- Introduced shortened data type strings (optional) to avoid getting extra long cached graph string keys with ops such as cat_out()
- Fixed data type issues in Monterey
- Removed the unused `use_scalar_value` argument from `getTensorsStringKey()`
- Clean up and refactoring

Fixes #89353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91786
Approved by: https://github.com/kulinseth
2023-01-06 17:28:49 +00:00
eeba9d5ab4 Preserve node's meta during fx.transformation (#90737)
We wish to preserve node.meta over fx.Transformer transformation and aot_autograd. This will preserve all the meta fields in the original node, including stack_trace, nn_module_stack, val, tensor_meta...

Sample

Here's a graph produced by Dynamo.
```
class GraphModule(torch.nn.Module):
    def forward(self, x : torch.Tensor, y : torch.Tensor):
        # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:35, code: a = torch.cos(x)
        cos = torch.cos(x);  x = None

        # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:36, code: b = torch.sin(y)
        sin = torch.sin(y);  y = None

        # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:37, code: return a + b
        add = cos + sin;  cos = sin = None
        return (add,)

x {'creation_timestamp': 0, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 45, in forward\n    def forward(self, x, y):\n'}
y {'creation_timestamp': 0, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 45, in forward\n    def forward(self, x, y):\n'}
cos {'creation_timestamp': 3, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 35, in forward\n    a = torch.cos(x)\n |   File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n    return self.block(x, y)\n'}
sin {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 36, in forward\n    b = torch.sin(y)\n |   File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n    return self.block(x, y)\n'}
add {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 37, in forward\n    return a + b\n |   File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n    return self.block(x, y)\n'}
output {'creation_timestamp': 4}
```

After lowering to aten graph with aot_autograd_simplified()
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: f32[2, 3], primals_2: f32[2, 3]):
        # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:35, code: a = torch.cos(x)
        cos: f32[2, 3] = torch.ops.aten.cos.default(primals_1)

        # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:36, code: b = torch.sin(y)
        sin: f32[2, 3] = torch.ops.aten.sin.default(primals_2)

        # File: /scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py:37, code: return a + b
        add: f32[2, 3] = torch.ops.aten.add.Tensor(cos, sin);  cos = sin = None
        return [add, primals_2, primals_1]

primals_1 {'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=True, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})}
primals_2 {'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=True, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})}
cos {'creation_timestamp': 3, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 35, in forward\n    a = torch.cos(x)\n |   File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n    return self.block(x, y)\n', 'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})}
sin {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 36, in forward\n    b = torch.sin(y)\n |   File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n    return self.block(x, y)\n', 'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})}
add {'creation_timestamp': 4, 'nn_module_stack': {'self_block': "<class '__main__.Block'>"}, 'stack_trace': '  File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 37, in forward\n    return a + b\n |   File "/scratch/bahuang/work/repos/pytorch/temp/dynamo_aotautograd_demo.py", line 46, in forward\n    return self.block(x, y)\n', 'val': FakeTensor(FakeTensor(..., device='meta', size=(2, 3)), cpu), 'tensor_meta': TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})}
output {}
```

Notice that output fx node have creation_time_stamp, nn_module_stack and stack_trace copied from the original fx node.
val and tensor_meta were latter populated by a subsequent fake_tensor_propagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90737
Approved by: https://github.com/jerryzh168
2023-01-06 17:21:02 +00:00
8e7dcd140a [dynamo] Support dynamic slicing (#91341)
Differential Revision: [D42223259](https://our.internmc.facebook.com/intern/diff/D42223259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91341
Approved by: https://github.com/voznesenskym
2023-01-06 16:52:12 +00:00
de99bc39e8 [MPS] Remap the view ops to exisiting graph APIs. (#89436)
This helps in performance by avoiding the generic gather/scatter graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89436
Approved by: https://github.com/razarmehr
2023-01-06 16:02:25 +00:00
2354ff5fab [functorch] test: try using reference_inputs in vmap tests (#91355)
Ref https://github.com/pytorch/functorch/issues/1090

Timings:

`test_vmap_exhaustive`

After PR
```
== 1168 passed, 55 skipped, 2353 deselected, 153 xfailed in 195.07s (0:03:15) ==
```

Before PR
```
== 1134 passed, 55 skipped, 2316 deselected, 150 xfailed in 77.18s (0:01:17) ==
```

`test_op_has_batch_rule`

After PR
```
== 988 passed, 57 skipped, 2353 deselected, 331 xfailed in 144.70s (0:02:24) ==
```

Before PR
```
== 969 passed, 57 skipped, 2316 deselected, 313 xfailed in 65.86s (0:01:05) ==
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91355
Approved by: https://github.com/zou3519
2023-01-06 15:00:36 +00:00
eb8547e939 Add a NestedTensor Readme (#91472)
# Summary
This PR adds a NestedTensor Readme which explains the code structure and will hopefully serve as a reference point for new contributors, especially if they would like to implement a NestedTensor kernel implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91472
Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch
2023-01-06 14:44:55 +00:00
859ac58c54 [Inductor] Support loop split at given depth in CPP codegen (#91397)
This PR refactors the loop related data structure to support the loop split at a given depth. Before this PR, the loop split is always supported at the inner-most level. With this PR, it is possible to support tiling at outer levels and at more than one levels. The `LoopNest` data structure is extended to support loop splits at various levels and renamed to `LoopNestWithSplit`. The `codegen_loops` function is also rewritten to be general to support arbitrary kernels set at the leaves of the loop structure.

This PR also improves the handling of reduction loops with split. The main loop and tail loop now work on their own reduction variables in parallel without data dependency as previous do. With this, two workarounds can be removed in the `CppVecKernel`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91397
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-01-06 12:53:46 +00:00
2555971b76 [inductor] fix output_stride of cat (#91233)
When the inputs to the ConcatKernel come from both ExternKernel and Loops, the output format of Loops might still be a FlexibleLayout (with contiguous strides). When deciding the output stride of the ConcatKernel, the Loops output has been wrongly assumed to be contiguous, thus the output format of the ConcatKernel is set to be contiguous.

In this PR, we propose the below heuristics to decide the output of the ConcatKernel:
If any of the inputs to ConcatKernel is a FixedLayout and is in the channels last format, we set the output of the ConcatKernel to the channels last format as well.

### Before
```python
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1)
{
    #pragma omp parallel num_threads(56)
    {
        #pragma omp for  collapse(2)
        for(long i0=0; i0<5; i0+=1)
        {
            for(long i1=0; i1<256; i1+=1)
            {
                {
                    {
                        auto tmp0 = in_ptr0[i0 + (5*i1)];
                        out_ptr0[i1 + (256*i0)] = tmp0;
                    }
                }
            }
        }
        #pragma omp for  collapse(2)
        for(long i0=0; i0<64; i0+=1)
        {
            for(long i1=0; i1<16; i1+=1)
            {
                #pragma GCC ivdep
                for(long i2=0; i2<16; i2+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr1[i0 + (128*i2) + (4096*i1)];
                            auto tmp1 = in_ptr1[64 + i0 + (128*i2) + (4096*i1)];
                            auto tmp3 = in_ptr1[2048 + i0 + (128*i2) + (4096*i1)];
                            auto tmp5 = in_ptr1[2112 + i0 + (128*i2) + (4096*i1)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            out_ptr1[i2 + (16*i1) + (256*i0)] = tmp6;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    primals_1, primals_2, primals_3, primals_4 = args
    args.clear()
    buf0 = aten.convolution(primals_3, primals_1, primals_2, (1, 1), (0, 0), (1, 1), False, (0, 0), 1)
    assert_size_stride(buf0, (1, 5, 16, 16), (1280, 1, 80, 5))
    del primals_2
    buf3 = empty_strided((1, 69, 16, 16), (17664, 256, 16, 1), device='cpu', dtype=torch.float32)
    buf1 = as_strided(buf3, (1, 5, 16, 16), (17664, 256, 16, 1))  # alias
    buf2 = as_strided(buf3, (1, 64, 16, 16), (17664, 256, 16, 1), 1280)  # alias
    kernel_cpp_0(c_void_p(buf0.data_ptr()), c_void_p(primals_4.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf2.data_ptr()))
    del buf0
    del primals_4
    return (buf3, primals_1, primals_3, )
```

### After
```python
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_chunyuan/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1)
{
    #pragma omp parallel num_threads(56)
    {
        #pragma omp for
        for(long i0=0; i0<256; i0+=1)
        {
            #pragma GCC ivdep
            for(long i1=0; i1<5; i1+=1)
            {
                {
                    {
                        auto tmp0 = in_ptr0[i1 + (5*i0)];
                        out_ptr0[i1 + (69*i0)] = tmp0;
                    }
                }
            }
        }
        #pragma omp for  collapse(2)
        for(long i0=0; i0<16; i0+=1)
        {
            for(long i1=0; i1<16; i1+=1)
            {
                for(long i2=0; i2<4; i2+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + (16*i2) + (128*i1) + (4096*i0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 64 + (16*i2) + (128*i1) + (4096*i0));
                    auto tmp3 = at::vec::Vectorized<float>::loadu(in_ptr1 + 2048 + (16*i2) + (128*i1) + (4096*i0));
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 2112 + (16*i2) + (128*i1) + (4096*i0));
                    auto tmp2 = at::vec::maximum(tmp1, tmp0);
                    auto tmp4 = at::vec::maximum(tmp3, tmp2);
                    auto tmp6 = at::vec::maximum(tmp5, tmp4);
                    tmp6.store(out_ptr1 + (16*i2) + (69*i1) + (1104*i0));
                }
                #pragma omp simd simdlen(8)
                for(long i2=64; i2<64; i2+=1)
                {
                    auto tmp0 = in_ptr1[i2 + (128*i1) + (4096*i0)];
                    auto tmp1 = in_ptr1[64 + i2 + (128*i1) + (4096*i0)];
                    auto tmp3 = in_ptr1[2048 + i2 + (128*i1) + (4096*i0)];
                    auto tmp5 = in_ptr1[2112 + i2 + (128*i1) + (4096*i0)];
                    auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                    auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                    auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                    out_ptr1[i2 + (69*i1) + (1104*i0)] = tmp6;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    primals_1, primals_2, primals_3, primals_4 = args
    args.clear()
    buf0 = aten.convolution(primals_3, primals_1, primals_2, (1, 1), (0, 0), (1, 1), False, (0, 0), 1)
    assert_size_stride(buf0, (1, 5, 16, 16), (1280, 1, 80, 5))
    del primals_2
    buf3 = empty_strided((1, 69, 16, 16), (17664, 1, 1104, 69), device='cpu', dtype=torch.float32)
    buf1 = as_strided(buf3, (1, 5, 16, 16), (17664, 1, 1104, 69))  # alias
    buf2 = as_strided(buf3, (1, 64, 16, 16), (17664, 1, 1104, 69), 5)  # alias
    kernel_cpp_0(c_void_p(buf0.data_ptr()), c_void_p(primals_4.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf2.data_ptr()))
    del buf0
    del primals_4
    return (buf3, primals_1, primals_3, )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91233
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-01-06 10:44:09 +00:00
c99a2a43ad [inductor] decompose tanh in CPP backend (#91687)
## Description
The decomposition of `tanh` has been removed in https://github.com/pytorch/pytorch/pull/90889.
```python
@register_decomposition([aten.tanh])
def tanh(x):
    return 2.0 / (1.0 + torch.exp(-2.0 * x)) - 1.0
```
We've observed performance regression on CPU for `lennard_jones` in the TorchBench suite.
This PR decomposes `tanh` in CPP backend to fix the regression.

### Performance

- Model: lennard_jones
- Machine: IceLake (32 cores per socket)
- Configuration: single instance, 32 cores per instance
- jemalloc and iomp enabled

```bash
python benchmarks/dynamo/torchbench.py  --inductor-settings --inductor --performance --float32 -dcpu -n500  --no-skip --dashboard --only=lennard_jones --quiet
```

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

Time before   regression | Time after regression | Time with this PR
-- | -- | --
0.000262036 | 0.0003618 | 0.000267888

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91687
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-06 10:05:36 +00:00
ad70a70171 Revert "[functorch] test: try using reference_inputs in vmap tests (#91355)"
This reverts commit a51090d4b14610d72a8e22209a7d69b5a90bf45d.

Reverted https://github.com/pytorch/pytorch/pull/91355 on behalf of https://github.com/kshitij12345 due to Broke trunk
2023-01-06 09:57:21 +00:00
a51090d4b1 [functorch] test: try using reference_inputs in vmap tests (#91355)
Ref https://github.com/pytorch/functorch/issues/1090

Timings:

`test_vmap_exhaustive`

After PR
```
== 1168 passed, 55 skipped, 2353 deselected, 153 xfailed in 195.07s (0:03:15) ==
```

Before PR
```
== 1134 passed, 55 skipped, 2316 deselected, 150 xfailed in 77.18s (0:01:17) ==
```

`test_op_has_batch_rule`

After PR
```
== 988 passed, 57 skipped, 2353 deselected, 331 xfailed in 144.70s (0:02:24) ==
```

Before PR
```
== 969 passed, 57 skipped, 2316 deselected, 313 xfailed in 65.86s (0:01:05) ==
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91355
Approved by: https://github.com/zou3519
2023-01-06 08:16:11 +00:00
d0a4e2e782 Don't remove files across the whole OS on clean (#91503)
setup.py clean now won't remove paths matching .gitignore patterns across the entire OS. Instead, now only files from the repository will be removed.

`/build_*` had to be removed from .gitignore because with the wildcard fixed, build_variables.bzl file was deleted on cleanup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91503
Approved by: https://github.com/soumith
2023-01-06 05:13:51 +00:00
e3bd38d224 [DTensor] fix test_device_mesh failure on GPU (#91783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91783
Approved by: https://github.com/wanchaol
2023-01-06 04:22:09 +00:00
66745831d7 [ONNX] Support constant 'aten::__contains__' (#91660)
#84624 introduces an update on `torch.norm` [dispatch logic](eaa43d9f25/torch/functional.py (L1489)) which now depends on `layout`. Resulting in regressions to export related operators from TorchScript.

This PR resolves the regression by partially supporting a subset use case of `prim::layout` (only `torch.strided`), `aten::__contains__` (only constants) operators. It requires much more effort to properly support other layouts, e.g. `torch.sparse_coo`. Extending JIT types, and supporting related family of ops like `aten::to_sparse`. This is out of the scope of this PR.

Fixes #83661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91660
Approved by: https://github.com/justinchuby, https://github.com/kit1980
2023-01-06 01:39:32 +00:00
2f0e4839ee [MPS] Fix correctness issues with Pooling ops (#91519)
- Workaround for MaxPool when ceilMode=true
- Workaround for ChannelsLast memory format
- Workaround for divisor_override in AvgPool ops
- Enabled count_include_pad parameter for AvgPool
- Refactoring and clean up of duplicate code
- Enable MaxPool tests in TestConsistency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91519
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-01-06 01:35:46 +00:00
33547bb587 inductor: Move graph.lint() in Intel's FX Passes to the End of Loop to Reduce Compile Time(part 2) (#91677)
As https://github.com/pytorch/pytorch/pull/91179 to Reduce Compile Time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91677
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-06 01:26:16 +00:00
25ff10caa7 inductor:enable conv+unary fusion for torch unary function (#91609)
This PR is about to enable unary fusion which the unary is the torch function, this PR will improve timm models performance a lot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91609
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-01-06 01:23:35 +00:00
2175c9414e [cpu] implement erf based on oneDNN algorithm for aten::Vec (#91613)
Aten's `erf` implementation will invoke `MKL` function which shows better performance than current Torchinductor's `erf` implementation who calls `sleef` function in `aten::Vec`. The performance benefits from the algorithm. `sleef` uses the Taylor expansion more precise than `MKL`, resulting in longer time-consuming. As the implementations of `erf` in `oneDNN` and `MKL` are similar, we implement the algorithm of `erf` in `aten::Vec` based on `oneDNN` algorithm.

Performance data for eager v.s. inductor:
`gelu` also benefits from this modification for it uses `erf`.

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link=blue vlink=purple>

suite | op_name | improved_ratio_speedup0.2 | improved_ratio_speedup0.5 | improved_ratio_speedup0.8 | speedup_old_0.2 | RSD(3) | speedup_old_0.5 | RSD(3) | speedup_old_0.8 | RSD(3) | speedup_new_0.2 | RSD(3) | speedup_new_0.5 | RSD(3) | speedup_new_0.8 | RSD(3)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | aten.erf.default | 138.54% | 138.54% | 138.54% | 0.402057897 | 13.54% | 0.402057897 | 13.54% | 0.402057897 | 13.54% | 0.959050302 | 4.21% | 0.959050302 | 4.21% | 0.959050302 | 4.21%
torchbench | aten.gelu.default | 196.94% | 16.28% | 3.28% | 0.303611506 | 0.88% | 0.865411422 | 0.23% | 0.984732108 | 0.15% | 0.901534389 | 1.04% | 1.006314977 | 0.10% | 1.017019831 | 0.37%
huggingface | aten.gelu.default | 178.90% | 153.93% | 22.70% | 0.324031619 | 8.16% | 0.40085369 | 1.67% | 0.839170801 | 1.30% | 0.90371451 | 2.25% | 1.017872459 | 0.47% | 1.029638829 | 0.49%
timm | aten.gelu.default | 12.76% | 3.01% | 1.98% | 0.892005539 | 0.22% | 0.979783341 | 0.16% | 0.998917466 | 0.08% | 1.005821648 | 0.11% | 1.009227094 | 0.07% | 1.018701655 | 0.30%
torchbench | aten.gelu_backward.default | 124.25% | 53.19% | 5.96% | 0.437150835 | 6.11% | 0.664341696 | 0.24% | 0.983091818 | 2.49% | 0.980304388 | 1.86% | 1.017688734 | 0.33% | 1.041684409 | 0.74%
huggingface | aten.gelu_backward.default | 126.26% | 32.55% | 11.61% | 0.446699743 | 0.34% | 0.781550075 | 0.73% | 0.989682073 | 0.28% | 1.010687581 | 1.31% | 1.035929929 | 1.11% | 1.104549968 | 2.68%
timm | aten.gelu_backward.default | 5.65% | 1.79% | 2.58% | 0.955116562 | 0.40% | 0.99782989 | 0.18% | 1.002408412 | 0.13% | 1.00905163 | 0.07% | 1.015649447 | 0.26% | 1.028238613 | 0.24%

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91613
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/EikanWang, https://github.com/desertfire
2023-01-06 01:20:49 +00:00
745dc3a13c [inductor] optimize lowering for empty-related operators (#91350)
For micro-benchmark, `new_empty_strided` and `new_empty` have poor performance with inductor compared to eager. The main reason is that inductor initializes new tensor with 0 during lowering, which generates a useless cpp kernel. Actually, it is not needed for operator semantics, but costs additional time. The same problem is also found in lowerings of `empty_strided` and `empty`. This PR tends to remove useless cpp kernel of tensor initialization by generating a NopKernelSchedulerNode instead of a SchedulerNode. The lowering functions of following operators are optimized:

- `torch.empty`
- `aten.empty`
- `aten.new_empty`
- `aten.empty_strided`
- `aten.new_empty_strided`

We take output code of `new_empty_strided` as example.

_Before change_
```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(float* __restrict__ out_ptr0)
{
    #pragma omp parallel num_threads(28)
    {
        #pragma omp for
        for(long i0=0; i0<57600; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>(static_cast<float>(0));
            tmp0.store(out_ptr0 + 16*i0);
        }
        #pragma omp for simd simdlen(8)
        for(long i0=921600; i0<921600; i0+=1)
        {
            auto tmp0 = static_cast<float>(0);
            out_ptr0[i0] = tmp0;
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((60, 60, 256), (15360, 256, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(buf0.data_ptr()))
    return (buf0, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((60, 60, 256), (60, 1, 3600), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```
_After change_
```
async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    buf0 = empty_strided((60, 60, 256), (15360, 256, 1), device='cpu', dtype=torch.float32)
    return (buf0, )

if __name__ == "__main__":
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((60, 60, 256), (60, 1, 3600), device='cpu', dtype=torch.float32)
    print_performance(lambda: call([arg0_1]))
```

Performance data for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

suite | op_name | improved_ratio_speedup0.2 | improved_ratio_speedup0.5 | improved_ratio_speedup0.8 | speedup_old_0.2 | RSD(3) | speedup_old_0.5 | RSD(3) | speedup_old_0.8 | RSD(3) | speedup_new_0.2 | RSD(3) | speedup_new_0.5 | RSD(3) | speedup_new_0.8 | RSD(3)
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | aten.new_empty_strided.default | 235.94% | 100.94% | 50.23% | 0.325947 | 2.96% | 0.550267 | 2.03% | 0.747997 | 2.93% | 1.094985 | 0.81% | 1.105722 | 0.55% | 1.12372 | 0.68%
huggingface | aten.new_empty_strided.default | 120.58% | 81.16% | 87.41% | 0.503116 | 28.27% | 0.668831 | 5.85% | 0.705637 | 2.76% | 1.109785 | 1.70% | 1.211641 | 0.74% | 1.322434 | 0.82%
timm | aten.new_empty_strided.default | 129.24% | 72.75% | 47.91% | 0.490658 | 15.87% | 0.76711 | 13.11% | 0.904033 | 4.44% | 1.124806 | 1.19% | 1.325182 | 0.65% | 1.337114 | 1.01%
torchbench | aten.new_empty.default | 69.41% | 1.60% | 0.90% | 0.732117 | 5.24% | 1.228356 | 1.18% | 1.241341 | 0.81% | 1.24031 | 1.96% | 1.248061 | 1.70% | 1.252525 | 1.84%
huggingface | aten.new_empty.default | 150.01% | 79.29% | 39.91% | 0.49547 | 12.67% | 0.692498 | 22.11% | 0.889526 | 27.37% | 1.238706 | 1.58% | 1.241606 | 1.49% | 1.244506 | 1.41%
timm | aten.new_empty.default | 11.61% | 11.13% | 11.07% | 1.115127 | 0.65% | 1.124302 | 0.80% | 1.132986 | 1.38% | 1.244582 | 1.12% | 1.249459 | 1.31% | 1.258416 | 1.14%

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91350
Approved by: https://github.com/EikanWang, https://github.com/anijain2305, https://github.com/jgong5, https://github.com/desertfire
2023-01-06 01:20:17 +00:00
e1a2b0d34f Fix test_math_ops for python-3.11 (#91774)
From [math.pow](https://docs.python.org/3/library/math.html#math.pow) documentation:
> Changed in version 3.11: The special cases `pow(0.0, -inf)` and `pow(-0.0, -inf)` were changed to return `inf` instead of raising [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError), for consistency with IEEE 754.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91774
Approved by: https://github.com/ngimel
2023-01-06 00:56:43 +00:00
de9c82f41a [Meta] Register aten.pixel_shuffle.default for meta (#91605)
**Summary**
Fixes #91551
`aten.pixel_shuffle.default` is not registered for meta and it always generates contiguous (channels-first) layout of outputs. It can be reproduced by `torch.compile` (as described in the issue #91551) and running in FakeTensorMode.

**Test plan**
python test/inductor/test_torchinductor.py -k test_pixel_shuffle_channels_last
python test/test_proxy_tensor.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91605
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/anijain2305
2023-01-06 00:45:14 +00:00
b2c68c1dea [Quant] Update IDeep to support oneDNN conv add fusion (#90605)
**Summary**
This PR updates IDeep to support oneDNN conv add fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90605
Approved by: https://github.com/jgong5
2023-01-05 23:58:59 +00:00
aab55d6d0d [Quant] Remove all the dequant nodes when the ref module has multi input args (#90157)
**Summary**:
When converting a ref module into a quant module, `_lower_static_weighted_ref_module` pass assumes the `ref_node` only has 1 input node, and only remove the first `dequant` node. We add a check in this PR to ensure this is the case for `_lower_static_weighted_ref_module` pass.

**Test Plan**:
We only add a check in this PR, there is no new added test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90157
Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5, https://github.com/jerryzh168
2023-01-05 23:58:45 +00:00
ae0c4c4c29 Update version numbers in torch.{stft,istft} deprecations (#91761)
Since there won't be a 1.14 release, these need to be updated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91761
Approved by: https://github.com/lezcano
2023-01-05 22:17:37 +00:00
2a64365a29 Fix rendering of std/var docs (#91730)
Due to the indentation, "versionchanged" is being rendered as if it was an
argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91730
Approved by: https://github.com/albanD, https://github.com/lezcano
2023-01-05 22:17:37 +00:00
f571ae4fdb Revert "Make torch.device usable as a context manager (#91525)"
This reverts commit 619d52a5d296bc236ac98f40c7f7de54ab7c9d37.

Reverted https://github.com/pytorch/pytorch/pull/91525 on behalf of https://github.com/mehtanirav due to Internal breakages
2023-01-05 21:34:50 +00:00
c73147f741 Revert "[decomp] Use new squeeze.dims overload in decompositions (#91602)"
This reverts commit 9262ffc692a1d2cd49597ae7f0a7e4394feca022.

Reverted https://github.com/pytorch/pytorch/pull/91602 on behalf of https://github.com/clee2000 due to stacked pr was reverted, this is dependent
2023-01-05 20:39:52 +00:00
0100293a7b feat: adding greater_equal Scalar variant (#91324)
Fixes https://github.com/pytorch/functorch/issues/1080

```py
import torch
from functorch import vmap

def f(x):
    return torch.greater_equal(torch.cumsum(x, dim=0), .5 * 10)

x = torch.randn([10,10])
vmap(f)(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91324
Approved by: https://github.com/zou3519
2023-01-05 20:25:38 +00:00
3b4e4d2b62 Make requirements-ci.txt reading cwd independent (#91771)
Discovered while running `test_typing.py` locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91771
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-01-05 20:08:23 +00:00
a5f32f8978 training support for dynamo+torchxla integration (#88449)
We've already shown some promising perf result by integrating dynamo with torchxla for inference. To provide consistent UX for training and for inference, in this PR we try to enable training for dynamo/torchxla.

Training is trickier than inference and we may not expect much perf gains since
1. in training case, torchxla only generate a single combined graph for fwd/bwd/optimizer while in `torchxla_trace_once` bridge we added in dynamo, due to how AOT_Autograd works, we will generate 3 graphs: one for forward, one for backward and one for the optimizer. XLA favors larger graph to do more optimizations.
2. in training case, tracing overhead can be overlapped with computation. Tracing overhead is not as a big deal for training as for inference. After all training cares more about throughput while inference cares more about latency.
3. in training case, people can increase batch size to 'mitigate' the tracing overhead. Increase batch size does not change tracing overhead, thus it shows like the tracing overhead 'per example' reduces.

But we still want to add training support to dynamo/torchxla to make the work complete.

We added '--iterations-per-run' argument to control how may iterations we do per measure/device sync. This is to understand the impact of item 2 above.

Results:

With '--iterations-per-run' equals to 1, here are the perf numbers:
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |             0.91   |                0.959    |
+-------------------------+--------------------+-------------------------+
| resnet50                |             0.917  |                0.932    |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |             0.912  |                0.905    |
+-------------------------+--------------------+-------------------------+
| alexnet                 |             1.038  |                0.974    |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |             0.881  |                0.835    |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |             0.903  |                0.931    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |             0.914  |                0.967    |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |             1.359  |                0.84     |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |             1.288  |                0.893    |
+-------------------------+--------------------+-------------------------+
| geomean                 |             1.0006 |                0.913794 |
+-------------------------+--------------------+-------------------------+
```

Overall it looks like graph break indeed cause perf loss. But for BERT_pytorch and timm_vision_transformer we still see perf gain. We need do more experiments with larger '--iterations-per-run'

NOTE:
In torchbench.py I added the following code to do a few workaround:
```
from myscripts import workaround # TODO will remove this line before landing
```

Here are the content of workaround.py:
```
import torch
from torch import nn
import os

# override max_pool2d with avg_pool2d
if os.environ.get("REPLACE_MAXPOOL", "0") == "1":
    torch.nn.MaxPool2d = torch.nn.AvgPool2d

```

It work around a few issues we found
1. MaxPool2d does not work for training in dynamo/torchxla: https://github.com/pytorch/torchdynamo/issues/1837 . WIP fix from Brian in https://github.com/pytorch/pytorch/pull/90226 , https://github.com/pytorch/xla/pull/4276/files (WIP)
2. recent change ( this PR https://github.com/pytorch/pytorch/pull/88697 ) in op decomposition cause batch_norm ops to fallback in torchxla. Fix from jack in https://github.com/pytorch/xla/pull/4282#event-7969608134 . (confirmed the fix after adding Deduper to handle duplicated return from fx graph generated by AOTAutograd)
3. we have issue to handle dropout because of random seed out of sync issue. Here is the fix: https://github.com/pytorch/xla/pull/4293 (confirmed the fix)

Example command:
```
REPLACE_MAXPOOL=1 USE_FAKE_TENSOR=0 GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=aot_torchxla_trace_once --only vgg16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88449
Approved by: https://github.com/wconstab, https://github.com/qihqi, https://github.com/malfet
2023-01-05 19:59:34 +00:00
df4b3b13bc Revert "squeeze: allow squeezing multiple dimensions at once (#89017)"
This reverts commit e26cb06681f4ae92ba28c802cbea263f9a97c2ff.

Reverted https://github.com/pytorch/pytorch/pull/89017 on behalf of https://github.com/mehtanirav due to Internal breakages
2023-01-05 19:25:08 +00:00
f11dc26ed5 [ROCm] tools/stats/monitor.py support (#91732)
Initial support for rocm-smi monitoring of GPU utilization.  Works around difficulties of using the rocm-smi python bindings without having an explicit package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91732
Approved by: https://github.com/huydhn, https://github.com/pruthvistony
2023-01-05 18:34:11 +00:00
9262ffc692 [decomp] Use new squeeze.dims overload in decompositions (#91602)
This removes the now-redundant `_squeeze_multiple` helpers and instead decomposes into a single call to `aten::squeeze.dims` which also has the effect of reducing the lowered graph size in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91602
Approved by: https://github.com/ngimel
2023-01-05 17:59:32 +00:00
3bb63aa387 Revert "Symintify pytorch slicing logic (#91340)"
This reverts commit 8c172fa98a52e95675e9425ac4b23f190f53f9ed.

Reverted https://github.com/pytorch/pytorch/pull/91340 on behalf of https://github.com/clee2000 due to breaking mac builds 8c172fa98a https://github.com/pytorch/pytorch/actions/runs/3845932024/jobs/6550654339, marking this as weird because it was merged via codev?
2023-01-05 17:14:49 +00:00
9ca37d6527 [MPS] Improve the performance of torch.linear() (#91114)
* Clean up redundant headers and namespaces from Linear.mm
* This should improve the Bert sample in #77799  by ~3x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91114
Approved by: https://github.com/DenisVieriu97, https://github.com/malfet, https://github.com/kulinseth
2023-01-05 16:30:27 +00:00
c775eb2879 [CI][ROCm] always stop all docker containers (#91740)
We observed multiple running docker containers on several ROCm self-hosted runners. This commit ensures all containers are stopped prior to starting the tests. This commit also fixes setup/teardown differences between various ROCm workflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91740
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-01-05 16:28:16 +00:00
1a0738f599 [MPS] Add support for torch.linalg.cross (#91642)
* Add support for torch.linalg.cross
* Make use of `metal::cross` for float and half. For the other dtypes implement cross manually

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91642
Approved by: https://github.com/razarmehr, https://github.com/malfet
2023-01-05 14:48:34 +00:00
8c172fa98a Symintify pytorch slicing logic (#91340)
Differential Revision: [D42223260](https://our.internmc.facebook.com/intern/diff/D42223260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91340
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-05 10:33:37 +00:00
18b37bbff9 Clang-Tidy: Improve tensorexpr headers with additional std::moves (#91572)
Splitting #91559 into smaller pieces

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91572
Approved by: https://github.com/ezyang
2023-01-05 09:57:54 +00:00
3d1772857e Apply clang-tidy perf improvements to aten and torch/jit/passes/onnx (#91726)
Applies some minor performance fixups to pytorch regarding an implicit promotion and unnecessary copies (when const ref would have worked just as well).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91726
Approved by: https://github.com/ezyang
2023-01-05 06:48:59 +00:00
bac33ea8b6 [CUDA] Drop CUDA 10 support (#89582)
CC @ptrblck @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89582
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-01-05 05:11:53 +00:00
13b3d862dd [vulkan] Move Tensor.* from ops/ folder to api/ folder (#91033)
Moves `Tensor.h` and `Tensor.cpp` from the `ops/` folder to the `api/` folder.

Differential Revision: [D42106179](https://our.internmc.facebook.com/intern/diff/D42106179/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91033
Approved by: https://github.com/kirklandsign
2023-01-05 02:46:49 +00:00
aa562f94b3 [vulkan] Remove dependencies from op/ in vTensor and move it to higher level namespace (#91023)
Small refactor to remove any code used by vTensor under the `op/` folder to appropriate locations in the `api/` folder. Also remove vTensor from the `ops` namespace, it now resides in the higher level `at::native::vulkan` namespace which will also be used for the Graph data structures in the future.

This is the last step required for vTensor to be able to moved to the api folder.

Differential Revision: [D42052680](https://our.internmc.facebook.com/intern/diff/D42052680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91023
Approved by: https://github.com/salilsdesai
2023-01-05 02:30:19 +00:00
c7f32613ec Find other temp directory for code cache if no /tmp (#91701)
Fixes https://github.com/pytorch/torchdynamo/issues/2004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91701
Approved by: https://github.com/anijain2305, https://github.com/wconstab
2023-01-05 02:29:52 +00:00
229f12bf6a [MPS] Implement nan_to_num() for MPS backend (#91110)
Added a test case, and also enabled it in TestConsistency

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91110
Approved by: https://github.com/malfet, https://github.com/kulinseth
2023-01-05 02:17:48 +00:00
197e57ee68 Use indexing instead of reshape for broadcasting (#91722)
This is needed for MLIR rewrite
This replaces
```
xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK, 1])
```
with
```
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
```
so code is a bit more readable, and compiles with master triton (which doesn't currently support first construct).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91722
Approved by: https://github.com/desertfire
2023-01-05 02:05:31 +00:00
ca62ed9067 [vulkan] Remove ATen dependencies in vTensor class (#91022)
This diff removes all dependencies on ATen from the vTensor class, in preparation for moving the class to the `api/` folder so that it can be part of the core library (i.e. part of the `torch_vulkan_api` target introduced in the below diff which should have no dependencies on ATen.

Most notably, the constructor of `vTensor` is changed to

```
  vTensor(
      api::Context* context,
      IntArrayRef sizes,
      const c10::ScalarType dtype = c10::kFloat,
      const api::StorageType storage_type = api::StorageType::TEXTURE_3D,
      const c10::MemoryFormat memory_format = c10::MemoryFormat::Contiguous);
```

Instead of accepting a `TensorOptions` argument, since `TensorOptions` is a part of ATen. The majority of changes in this diff are due to updating vTensor construction to use the new constructor.

Differential Revision: [D42049862](https://our.internmc.facebook.com/intern/diff/D42049862/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91022
Approved by: https://github.com/kimishpatel
2023-01-05 01:51:02 +00:00
f630294f59 Optimize GELU BFloat16 Impl in CPU path (#79378)
### Description
For slow path (with non-contiguous inputs) with `none` or `tanh` approximate, current bfloat16 impl is not performance friendly in ATen. This PR uses float32 as an immediate type, in order to reduce the heavy cost of converting bf16 to fp32.

### Test
IceLake 2S 32C (Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz)

**single socket (32 cores):**
approximate is `none`:
|input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
|[16, 32, 32] | 0.361 | 1.055 | 0.348 | 0.672
|[32, 32, 64] | 0.084 | 2.003 | 0.076 | 1.426
|[32, 64, 128] | 0.237 | 2.007 | 0.22 | 1.454
|[64, 128, 128] | 2.23 | 6.348 | 1.943 | 4.103

approximate is `tanh`:
|input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
[16, 32, 32] | 0.203 | 1.209 | 0.138 | 0.474
[32, 32, 64] | 0.063 | 2.497 | 0.043 | 0.985
[32, 64, 128] | 0.201 | 2.707 | 0.141 | 1.205
[64, 128, 128] | 1.549 | 8.749 | 1.065 | 3.635

**single core:**
approximate is `none`:
|input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
[16, 32, 32] | 0.359 | 1.055 | 0.267 | 0.592
[32, 32, 64] | 1.11 | 3.483 | 1.063 | 2.373
[32, 64, 128] | 4.478 | 13.866 | 4.27 | 9.426
[64, 128, 128] | 17.675 | 55.231 | 16.805 | 37.509

approximate is `tanh`:
|input shapes  | forward ( base) (ms) | backward (base) (ms) | forward (optimized) (ms) | backward (optimized) (ms)
|--|------| --| --| --|
[16, 32, 32] | 0.202 | 1.212 | 0.138 | 0.473
[32, 32, 64] | 0.776 | 4.843 | 0.531 | 1.872
[32, 64, 128] | 3.203 | 19.267 | 2.16 | 7.243
[64, 128, 128] | 12.33 | 76.834 | 8.286 | 29.553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79378
Approved by: https://github.com/mingfeima
2023-01-05 01:43:17 +00:00
ad7aefb608 Fix Meta tests for FFT functions (#91628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91628
Approved by: https://github.com/kit1980
2023-01-05 00:58:26 +00:00
b44d46702a [MPS] Fix correctness issues with Upsample 1D and 2D (#91669)
- Implemented following new ops: upsample_nearest1d_backward
upsample_nearest_exact1d
upsample_nearest_exact1d_backward
- Moved Upsample code from Shape.mm to Upsample.mm
- Fallback to CPU for nearest mode on Monterey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91669
Approved by: https://github.com/malfet
2023-01-05 00:48:54 +00:00
7ff97d2e95 update .circleci/docker/common/install_cmake.sh for centos (#91647)
Otherwise .circleci/docker/common/install_cmake.sh fails for centos due to use of apt-get instead of yum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91647
Approved by: https://github.com/malfet
2023-01-05 00:43:10 +00:00
64a3738fcd [vulkan] Remove external dependencies in core API and introduce torch_vulkan_api target (#91021)
This diff isolates the core components of the Pytorch Vulkan backend into its own target (`//xplat/caffe2:torch_vulkan_api`). The main motivation for this is to create a library that does not have a dependency on the ATen library which can then be used to build a graph mode runtime for Vulkan for Executorch.

In addition to introducing the new target, this diff also removes some references to external dependencies in the `api/` folder so that files in that folder are completely self contained.

Differential Revision: [D42038817](https://our.internmc.facebook.com/intern/diff/D42038817/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42038817/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91021
Approved by: https://github.com/kirklandsign
2023-01-05 00:41:23 +00:00
700399e3f1 Make sure the ends of linspace are correct regardless of the precision (#91625)
This operation is usually called with small sizes, so the fact that this
adds a couple of operations should be alright. Even more, given the
structure of the data, the branching in the `where` is pretty much free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91625
Approved by: https://github.com/peterbell10, https://github.com/ngimel
2023-01-05 00:23:19 +00:00
223d1aa692 Improve linspace decomposition and remove its lowering (#91621)
The code produced by the lowering and the decomposition is now the same
modulo a casting to `float32`. This casting is necessary as otherwise
the tests do not pass due to accuracy errors. We prefer accuracy over
speed here, given that this is an associative scan, and thus it's prone
to numerical errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91621
Approved by: https://github.com/ngimel
2023-01-05 00:23:19 +00:00
6790a558dd Simplify macOS build instruction (#91561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91561
Approved by: https://github.com/malfet
2023-01-05 00:10:16 +00:00
d6bd67f2eb vmap support for torch.trace (#91679)
Fixes #91404

As expected

```python
import torch
from functorch import vmap
x = torch.randn(32, 3, 3, 3)
y = vmap(torch.trace)(x)
print(y)
```

Now gives the exact same runtime error as eager mode

```
(sourcetorch) ubuntu@ip-172-31-39-26:~/test$ python functorch_test_pos.py
Traceback (most recent call last):
  File "functorch_test_pos.py", line 4, in <module>
    y = vmap(torch.trace)(x)
  File "/home/ubuntu/pytorch/torch/_functorch/vmap.py", line 420, in wrapped
    return _flat_vmap(
  File "/home/ubuntu/pytorch/torch/_functorch/vmap.py", line 39, in fn
    return f(*args, **kwargs)
  File "/home/ubuntu/pytorch/torch/_functorch/vmap.py", line 605, in _flat_vmap
    batched_outputs = func(*batched_inputs, **kwargs)
RuntimeError: trace: expected a matrix, but got tensor with dim 3
```

Equivalent eager code

```python
import torch
x = torch.randn(32, 3, 3, 3)
results = []
for xi in x:
  y = torch.trace(xi)
  results.append(y)
```

```
(sourcetorch) ubuntu@ip-172-31-39-26:~/test$ python functorch_test_neg.py
Traceback (most recent call last):
  File "functorch_test_neg.py", line 5, in <module>
    y = torch.trace(xi)
RuntimeError: trace: expected a matrix, but got tensor with dim 3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91679
Approved by: https://github.com/zou3519
2023-01-04 23:45:49 +00:00
56db21aec1 [Checkpoint][Test] Add test for optimizer state_dict and resharding to 2d checkpoint test (#91092)
This PR updates the 2d checkpoint model state test to include:
1. optimizer state dict test
2. simple resharding test  (pg change)
3. rename test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91092
Approved by: https://github.com/fduwjj
2023-01-04 23:26:30 +00:00
7dd28e9e83 [MPS] Fix data type and shape issues in Scatter and Gather ops (#91514)
- Clean up redundant code and headers
- Move scatter/gather ops from block list to allow list in TestConsistency
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91514
Approved by: https://github.com/kulinseth
2023-01-04 23:20:01 +00:00
fc59664ef4 [MPS] Add Unique and unique_consecutive ops. (#88532)
Add check for macos 13.0

Fixes #88487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88532
Approved by: https://github.com/malfet
2023-01-04 22:15:13 +00:00
13de5a0150 [MPS] Fix the right padding bug in Monterey (#91522)
- Workaround for the bool type bug in padding (needed for both Monterey and Ventura)
- Move the recently fixed padding tests of TestConsistency to AllowList

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91522
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet
2023-01-04 22:00:37 +00:00
1effabe257 Support per-parameter test decoration (#91658)
Continuation of #79979.

Fixes #79161

This PR does the following:
* Expands the `parametrize_fn()` signature from returning a 3-tuple of `(test, test_name, param_kwargs)` to returning a 4-tuple of `(test, test_name, param_kwargs, decorator_fn)`. Expected signature for the addition is `decorator_fn(param_kwargs) -> List[decorator]` i.e. given the full set of test params, return a list of decorators to apply.
    * `modules`, `ops`, and `parametrize` now fit the new signature, returning `decorator_fn`s instead of applying decorators themselves.
    * `instantiate_parametrized_tests()` and `instantiate_device_type_tests()` now call the returned `decorator_fn`, passing in the full set of `param_kwargs` (after composition + `device` / `dtype` additions) and applying the returned decorators.
    * Composing multiple `parametrize_fn`s also composes the corresponding `decorator_fn`s; the composed `decorator_fn` simply concatenates the decorator lists returned by the constituents.
* Expands `DecorateInfo.is_active` to support callables:
```python
DecorateInfo(
    unittest.expectedFailure, "TestOps", "test_python_ref_executor",
    device_type='cuda', active_if=lambda params: params['executor'] == 'nvfuser'
),
```
* Adds several tests to `test/test_testing.py` ensuring proper decoration using `@parametrize`, `@modules`, and `@ops`.
* (minor) Fixes a couple `ModuleInfo` naming oddities uncovered during testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91658
Approved by: https://github.com/malfet
2023-01-04 21:08:32 +00:00
0e60bef516 [Lint] Update clang-tidy to 11.1.0 (#91709)
Also, add option to download to distinguish between universal/i386 only
and separate i386 and arm binaries for MacOS

Follow up for https://github.com/pytorch/test-infra/pull/1354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91709
Approved by: https://github.com/huydhn
2023-01-04 20:04:07 +00:00
d4713b4c7d [dynamo] Fix bug in tensor.item fake tensor propogation (#91668)
When we run the node with fake value for tensor.item, it would previously error because the utility method doesn't know how to handle placeholder node. The tensor we are calling item can be input from user will be placeholder in the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91668
Approved by: https://github.com/voznesenskym
2023-01-04 19:51:19 +00:00
4bad40f559 Revert "inductor: add conv+hardsigmoid fusion for cpu path (#91433)"
This reverts commit 1d2bfea33e59d2e6fbff57755cd92d9942488a23.

Reverted https://github.com/pytorch/pytorch/pull/91433 on behalf of https://github.com/mehtanirav due to Internal breakages due to different ideep version
2023-01-04 19:44:26 +00:00
c18e8c68d8 [ROCm] fix parallel test runners and device visibility (#91137)
Fixes #90940.  This PR revamps how tests are run in parallel as well as device visibility at the docker container and within the run_test.py test runner.

First, running multiple test modules concurrently on the same GPU was causing instability for ROCm runners manifesting as timeouts.  ROCm runners have at least 1 GPU each, but often 2 or more.  This PR allows NUM_PROCS to be set equal to the number of devices available, but also takes care to set HIP_VISIBLE_DEVICES to avoid oversubscribing any GPU.

Second, we had introduced env vars `-e ROCR_VISIBLE_DEVICES` (#91031) to prepare for two GHA runners per CI node, to split up the GPU visibility at the docker level between the two runners.  This effort wasn't fully realized; to date, we haven't had more than one runner per CI host.  We abandon this effort in favor of all GPUs being visible to a single runner and managing GPU resources as stated above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91137
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/pruthvistony
2023-01-04 19:40:05 +00:00
5a6019033f [bazel] change visibility for //c10:headers (#91422)
At Cruise we are actively depending on the c10 headers, I'm not certain what is the reason to hide them to the pkg level.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91422
Approved by: https://github.com/malfet
2023-01-04 19:04:35 +00:00
17bc40c19d add __hash__ to FunctionSchema (#90730)
This PR adds __hash__ to FunctionSchema pybind binding, so that
it could be used for things like dict indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90730
Approved by: https://github.com/ezyang
2023-01-04 18:59:22 +00:00
a7749ae177 [reland] rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218) (#89221)
Summary: First half of #87990. This doesn't change any of the behavior and is just a rename

#88218 got reverted for internal breakages. This is the reland of started from internal

Differential Revision:
D41268423

LaMa Project: L1098534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89221
Approved by: https://github.com/meliy-meyada, https://github.com/zou3519
2023-01-04 18:32:49 +00:00
a5e2309f5e [bazel] Add @pytorch in tools/bazel.bzl (#91424)
This is a follow-up from #89660
There is another place that needs to be updated.

I think this time I covered all of them...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91424
Approved by: https://github.com/malfet
2023-01-04 18:28:19 +00:00
1e725c9747 Avoid device casting for all singleton tensors in optimizer states (#91454)
Fixes #75224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91454
Approved by: https://github.com/janeyx99
2023-01-04 17:55:00 +00:00
979255067d [MPS] Fix the crash in max_out() caused by cached key conflict (#91520)
The shape of input and indices tensors were missing in the cached key
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91520
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet
2023-01-04 17:53:19 +00:00
ce9963e6ba Fix typo in _lobpcg.py (#91641)
represenation -> representation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91641
Approved by: https://github.com/zou3519
2023-01-04 15:19:05 +00:00
66b3325304 Adds more nvidia pypi dependencies (#89944)
This PR adds more nvidia pypi dependencies for cuda 11.7 wheel. Additionally, it pins cufft version to 10.9.0.58 to resolve https://github.com/pytorch/pytorch/issues/88038

Depends on: https://github.com/pytorch/builder/pull/1196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89944
Approved by: https://github.com/atalman
2023-01-04 15:08:08 +00:00
e26cb06681 squeeze: allow squeezing multiple dimensions at once (#89017)
Ref #70924

This addresses part 1 of the issue, allowing `torch.squeeze` to be
passed a tuple of dimensions. e.g.
```python
x.squeeze(0).squeeze(0)
```
can now be written
```python
x.squeeze((0, 1))
```
(assuming x has at least 2 dimensions)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89017
Approved by: https://github.com/albanD
2023-01-04 14:40:56 +00:00
3120054c15 Vectorize norm(double, p=2) on cpu (#91502)
This gives a speed up of 100x on my machine:

```
[------------------ Master -------------------]
                                |  (200000, 3)
32 threads: ----------------------------------
      torch linalg_norm         |     10000
      torch linalg_vector_norm  |     10000
      torch custom              |       397
      numpy norm                |      3123
      numpy custom_np           |      3119

Times are in microseconds (us).

[------------------- PR -------------------]
                                |  (200000, 3)
32 threads: ----------------------------------
      torch linalg_norm         |       107
      torch linalg_vector_norm  |       100
      torch custom              |       400
      numpy norm                |      3170
      numpy custom_np           |      3162

Times are in microseconds (us).
```

Fixes https://github.com/pytorch/pytorch/issues/91373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91502
Approved by: https://github.com/mingfeima, https://github.com/ngimel
2023-01-04 08:03:38 +00:00
2004df9097 Remove python ddp (#91663)
As it is not used by anyone and also it is not maintained by PyTorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91663
Approved by: https://github.com/rohan-varma
2023-01-04 05:22:30 +00:00
ebb7f20afc quant: make various configs printable (#91419)
Summary:

Makes various quantization configs print out human readable values instead
of just the class name. This is useful when printing these configs out when
debugging.

Test plan:

test script
```
conf_1 = torch.ao.quantization.backend_config.backend_config.DTypeConfig()
print(conf_1)

conf_2 = torch.ao.quantization.backend_config.backend_config.BackendConfig()
print(conf_2)

conf_3 = torch.ao.quantization.backend_config.backend_config.BackendPatternConfig()
print(conf_3)

conf_4 = torch.ao.quantization.fx.custom_config.PrepareCustomConfig()\
    .set_input_quantized_indexes([0])
print(conf_4)

conf_5 = torch.ao.quantization.fx.custom_config.ConvertCustomConfig()\
    .set_preserved_attributes(['foo'])
print(conf_5)

conf_6 = torch.ao.quantization.fx.custom_config.FuseCustomConfig()\
    .set_preserved_attributes(['foo'])
print(conf_6)
```

test script output
```
DTypeConfig(input_dtype_with_constraints=DTypeWithConstraints(dtype=None, quant_min_lower_bound=None, quant_max_
upper_bound=None, scale_min_lower_bound=None, scale_max_upper_bound=None, scale_exact_match=None, zero_point_exa
ct_match=None), output_dtype_with_constraints=DTypeWithConstraints(dtype=None, quant_min_lower_bound=None, quant
_max_upper_bound=None, scale_min_lower_bound=None, scale_max_upper_bound=None, scale_exact_match=None, zero_poin
t_exact_match=None), weight_dtype_with_constraints=DTypeWithConstraints(dtype=None, quant_min_lower_bound=None,
quant_max_upper_bound=None, scale_min_lower_bound=None, scale_max_upper_bound=None, scale_exact_match=None, zero
_point_exact_match=None), bias_dtype=None, is_dynamic=None)
BackendConfig({'name': '', '_pattern_complex_format_to_config': {}})
BackendPatternConfig({'observation_type': <ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT: 0>})
PrepareCustomConfig({'input_quantized_indexes': [0]})
ConvertCustomConfig({'preserved_attributes': ['foo']})
FuseCustomConfig({'preserved_attributes': ['foo']})
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91419
Approved by: https://github.com/andrewor14
2023-01-04 04:52:20 +00:00
316ba9e6fc Run jit legacy tests sequentially (#91518)
Fixes https://github.com/pytorch/pytorch/issues/91457.  I have been re-running the 2 tests `test_jit_legacy` and `test_jit_fuser_legacy` in `jit_legacy` shard multiple times (100+) without any flaky issues found.  I suspect that we might have a test parallelization flakiness here.  So this PR runs these 2 tests serially.  They takes less than 5 minutes to finish, so running them sequentially won't be an issue (https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=jit_legacy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91518
Approved by: https://github.com/clee2000
2023-01-04 04:13:01 +00:00
80394bb734 [MPS] Register norm_dtype_out_mps and cdist (#91643)
Add support for `norm_dtype_out` and `cdist` ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91643
Approved by: https://github.com/razarmehr
2023-01-04 02:20:53 +00:00
619d52a5d2 Make torch.device usable as a context manager (#91525)
Fixes https://github.com/pytorch/pytorch/issues/82296
Fixes https://github.com/pytorch/pytorch/issues/27878
Fixes https://github.com/pytorch/pytorch/issues/260

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91525
Approved by: https://github.com/albanD
2023-01-04 01:32:00 +00:00
aa0ca994ca [Inductor] add missing ops for cpp vectorization overrides (#90750)
For micro-benchmark, aten.elu.default and aten.elu_backward.default have poor performance with inductor compared to eager. The main reason is lack of the vectorization. With adding missing ops for cpp vectorization overrides, the vectorization could be successfully applied.

Performance data for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl63
	{mso-number-format:Percent;}
.xl64
	{color:gray;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

op | speedup_old | RSD (3) | speedup_new | RSD (3) | increased_performance
-- | -- | -- | -- | -- | --
aten.elu.default | 0.205947276 | 1.73% | 0.995302802 | 4.76% | 383.28%
aten.elu_backward.default | 0.336280639 | 0.58% | 1.69473642 | 1.96% | 403.96%

</body>

</html>

The new supported ops for cpp vectorization overrides:
- eq
- ne
- lt
- gt
- le
- ge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90750
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2023-01-04 01:31:43 +00:00
1d2bfea33e inductor: add conv+hardsigmoid fusion for cpu path (#91433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91433
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-01-04 01:22:07 +00:00
6f9a4ae5c9 Revert "Populate the eviction_policy field for load/store properly (#91316)"
This reverts commit 3f4e87beaf67ec44d609605777d9da9e65cfbdd9.

Reverted https://github.com/pytorch/pytorch/pull/91316 on behalf of https://github.com/ngimel due to regresses performance
2023-01-04 00:47:37 +00:00
e116f1a3ff Add an env variable to disable addmm_cuda_lt kernel (#91436)
addmm_cuda_lt failed for some corner cases, so far we can not reproduce the corner cases in the unit tests, seems that the failures do not only depend on matrices' shape and strides. For now, add an environment variable to allow users disable this kernel for such corner cases.

**See the case one with more error logs:**

RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 80 n 1024 k 160 mat1_ld 160 mat2_ld 160 result_ld 80 abcType 14 computeType 68 scaleType 0 result_shape 1024 80  result_stride 80 1  self_shape 80  self_stride 1  mat1_shape 1024 160  mat1_stride 160 1  mat2_shape 160 80  mat2_stride 1 160
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):

**another case with more error logs:**

RuntimeError: 0CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 16 n 16384 k 48 mat1_ld 48 mat2_ld 48 result_ld 16 abcType 14 computeType 68 scaleType 0 result_shape 16384 16  result_stride 16 1  self_shape 16  self_stride 1  mat1_shape 16384 48  mat1_stride 48 1  mat2_shape 48 16  mat2_stride 1 48
Exception raised from gemm_and_bias at fbcode/caffe2/aten/src/ATen/cuda/CUDABlas.cpp:1071 (most recent call first):

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91436
Approved by: https://github.com/ngimel
2023-01-04 00:46:19 +00:00
162474d7fd [functorch] add new ensembling api, demonstrate in example (#88850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88850
Approved by: https://github.com/zou3519
2023-01-04 00:33:14 +00:00
c5e5916fff [functorch] add functorch functional_call, update tests to test this (#89213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89213
Approved by: https://github.com/zou3519
2023-01-04 00:33:14 +00:00
264f5ed516 [autograd.Function] Add docs on the functorch interaction (#91452)
This PR:
- Updates autograd.Function.forward docs to reflect how you either
  define a forward with ctx or a separate forward and setup_context
- Updates the "Extending Autograd" docs to suggest the usage of
  autograd.Function with separate forward and setup_context. This should
  be the default because there is a low barrier to go from this to
  an autograd.Function that is fully supported by functorch transforms.
- Adds a new "Extending torch.func with autograd.Function" doc that
  explains how to use autograd.Function with torch.func. It also
  explains how to use generate_vmap_rule and how to manually write a
  vmap staticmethod.

While writing this, I noticed that the implementation of
setup_context staticmethod/generate_vmap_rule/vmap staticmethod are a
bit inconsistent with the other method/attributes on autograd.Function:
- https://github.com/pytorch/pytorch/issues/91451
- I'm happy to fix those if we think it is a problem, either in this PR
  or a followup (this PR is getting long, I want some initial docs
  out that I can point early adopters at, and fixing the problems in the
  future isn't really BC-breaking).

Test Plan:
- view docs preview
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91452
Approved by: https://github.com/soulitzer
2023-01-04 00:28:19 +00:00
31a699934b Remove CircleCI ios PR jobs (#91638)
We added this because we wanted to burn our extra CIrcleCI credits, but now that it's the next year, those should be gone.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91638
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2023-01-04 00:27:49 +00:00
38de981e16 [MPS] Add nonzero mps support (#91616)
Adds nonzero support for mps:

  **Pseudocode**:
  ```
  //
  // inputTensor   = [1,  0,  0,  3]
  // inputNonZero  = [1,  0,  0,  1] (input != 0)
  // scan          = [1,  1,  1,  2] (prefix sum)
  // maskedIndices = [0, -1, -1,  1] (select)
  // coordinates   = [0,  1,  2,  3] (coordinateAlongAxis)
  // scatterResult = [0,  3]         (scatter)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91616
Approved by: https://github.com/razarmehr
2023-01-04 00:02:24 +00:00
eqy
97ff20d722 [cuBLAS] (re-open) Fix default cuBLAS workspace size and parsing for multiple workspaces (#91564)
re-open of #89027
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91564
Approved by: https://github.com/ngimel
2023-01-03 23:48:15 +00:00
0a6053e9b5 Revert "Avoid copies in matmul (#76828)"
This reverts commit 8c2e82b48790afb7df8d77ffd9ced74083a3f5b7.

Reverted https://github.com/pytorch/pytorch/pull/76828 on behalf of https://github.com/mehtanirav due to Internal breakages
2023-01-03 23:36:58 +00:00
6bf0e3b697 [inductor] Check for BackendCompilerFailed on CI (#91634)
Summary: https://github.com/pytorch/pytorch/pull/91283/ skips certain
random triton failure on CI, but we need to check against the
BackendCompilerFailed exception type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91634
Approved by: https://github.com/ngimel
2023-01-03 22:38:29 +00:00
3a60debe9d implement ordering (#91362)
# Summary

In some cases, dependent on input, flash-attention is not the fastest fused kernel and memory-efficient attention is better. This implements a simple heuristic function for deciding the ordering of kernel functions.  This was based off of the xformer function found here: 15bff4986c/xformers/ops/fmha/dispatch.py (L13)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91362
Approved by: https://github.com/cpuhrsch
2023-01-03 22:33:14 +00:00
743c385543 refactor show_traces in memory_tracker (#90145)
refactor show_tracers in memory_tracker to make it plot multiple figures and also can load serialized stats and then plot figures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90145
Approved by: https://github.com/rohan-varma
2023-01-03 22:10:15 +00:00
b6bb726cc3 Revert "Dispatch the auxiliary frobenius_norm and nuclear_norm to better implementations and deprecate them (#81763)"
This reverts commit 122245985a544d9d74d7b5037493541f5e525498.

Reverted https://github.com/pytorch/pytorch/pull/81763 on behalf of https://github.com/mehtanirav due to Internal breakages
2023-01-03 21:54:25 +00:00
57b7f33ba8 [Inductor] Move graph.lint() in Intel's FX Passes to the End of Loop to Reduce Compile Time (#91179)
Summary: Move `graph.lint()` in Intel's FX passes to the end of loop to reduce compile time, as there is no need to place `graph.lint()` within loop

Test Plan: CI

Reviewed By: jansel

Differential Revision: D41964322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91179
Approved by: https://github.com/XiaobingSuper, https://github.com/jansel
2023-01-03 21:26:31 +00:00
818079dc4e disabled flaky c2 test (#91640)
Summary: disables flaky test, T93236537

Test Plan: Existing tests

Differential Revision: D42314944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91640
Approved by: https://github.com/malfet
2023-01-03 21:26:21 +00:00
7ef7c57ae7 CSC/BSC -> COO coalesce fix (#91440)
Fixes https://github.com/pytorch/pytorch/issues/91010.

CSC and BSC sparse formats are not inherently `coalesced`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91440
Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch
2023-01-03 18:42:39 +00:00
4709523722 Revert D42051833: Multisect successfully blamed D42051833 for test or build failures (#91458)
Summary:
This diff is reverting D42051833
D42051833 has been identified to be causing the following test or build failures:

Tests affected:
- [//xplat/pytorch_models/build/MultitaskPeopleSegmentation/v7020:MultitaskPeopleSegmentation7020_testAndroid-64bit - runAllTests (com.facebook.xplat.XplatTestRunner)](https://www.internalfb.com/intern/test/281475056077477/)
- [//xplat/pytorch_models/build/MultitaskPeopleSegmentation/v4020:PYTORCH_MODEL_testAndroid-64bit - runAllTests (com.facebook.xplat.XplatTestRunner)](https://www.internalfb.com/intern/test/844425007913475/)

Here's the Multisect link:
https://www.internalfb.com/intern/testinfra/multisect/1478566
Here are the tasks that are relevant to this breakage:
T93205881: 15 tests started failing for oncall ai_infra_mobile_platform in the last 2 weeks
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

Test Plan: NA

Differential Revision: D42090396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91458
Approved by: https://github.com/kit1980
2023-01-03 18:17:35 +00:00
2965d7e11a [CI] Disable rocm distributed tests (#91632)
As they has been broken since Dec 16th
See https://github.com/pytorch/pytorch/issues/91630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91632
Approved by: https://github.com/atalman, https://github.com/albanD
2023-01-03 17:14:19 +00:00
688e351970 [MPS] Implement MPSGenerator to enable manual random seeding (#91348)
This patch adds support for creating torch.Generator for MPS device, and enables its functions such as manual_seed, get_state, and set_state.
Fixes #84288 and #84516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91348
Approved by: https://github.com/malfet, https://github.com/albanD
2023-01-03 16:01:19 +00:00
dfb651452a inductor: meta registration for mkldnn ops (#91299)
Fix https://github.com/pytorch/torchdynamo/issues/198, which supports Meta tensor for conv/linear fused ops to reduce the compilation time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91299
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-01-03 14:24:36 +00:00
8c2e82b487 Avoid copies in matmul (#76828)
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805

Fixes https://github.com/pytorch/pytorch/issues/76702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828
Approved by: https://github.com/ngimel
2023-01-03 14:18:38 +00:00
db2a237763 Revert "Avoid copies in matmul (#76828)"
This reverts commit 0c3659586d26a762426805af5d4536e0dd01a0c6.

Reverted https://github.com/pytorch/pytorch/pull/76828 on behalf of https://github.com/lezcano due to Makes functorch tests fail
2023-01-03 12:26:29 +00:00
2b0abd4ce3 symbolic shapes: add parenthesis around FloorDiv expression (#91554)
Before it would print the guard expression like:
`2*3//2`
and now:
`2*(3//2)`

```python
print(2*3//2)   # 3
print(2*(3//2)) # 2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91554
Approved by: https://github.com/ezyang
2023-01-03 11:12:08 +00:00
f7939b21e1 [MPS] Add bincount support for mps (#91267)
Add support for bincount on MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91267
Approved by: https://github.com/razarmehr
2023-01-03 06:01:07 +00:00
cb3204823e adding test to audit CompositeImplicitAutograd ops that do not have a batching rule (#91367)
Fixes https://github.com/pytorch/functorch/issues/1087

It looks like there are `306` rules that should be looked into
```
test/functorch/test_vmap_registrations.py .x.....xxxxxxx.x.x.x.x.x.x.x.x........xx.x.x..x.x.xxx...xxxx.x.x.x........x.........xxxxx..x..x.....xx...xx.....xxx.xxxxxxxxxxxxxxxxx.. [ 24%]
.........x.x......x.xxxxxx..x..xx.x.xxx.x.......x.xxx.xx..xxx.xxx...xxxxx.x....xxxxxxxxxxxxxxx....xx.xxx.xx.x...xx...xx...xxxxxx...xxxxx..x...xxxxxxxxxxxx..xx..xx.xx.x..xxxx..xx [ 56%]
.xx..x.x....xxxxxx.x.xx...xxxxx.xx...x..x.x.xx...xx.xxxxxx.xxxxxx..x........xxxxxxxx..xxxxxxxx..xx.xxxxxxxxxxxxxxxxxxxxxxx..........xxxx.xxxx.........xxxxxxxx..xxx..xxx.x.x.x.xx [ 88%]
xx.xxx.x......xxx.x.xxxxxxxx....x......xxxxxxxxx.xx.x.x.x.......xx                                                                                                                [100%]

=================================================================== 249 passed, 1185 deselected, 306 xfailed in 3.17s ===================================================================

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91367
Approved by: https://github.com/zou3519
2023-01-03 04:21:39 +00:00
6e236553f5 implemented test and Changed assert to TORCH_CHECK #88808 (#91273)
Fixes #88808
Replaced

`AT_ASSERT(dims < MAX_TENSORINFO_DIMS) `
in aten/src/ATen/cuda/detail/TensorInfo.cuh by

```
  data = p;
  dims = dim;
  TORCH_CHECK(dims < MAX_TENSORINFO_DIMS, "CUDA Tensors cannot have more than 25 dimensions");
    }

```

In : torch/testing/_internal/common_methods_invocations.py

```
def error_inputs_median(op_info, device, **kwargs):
    x = torch.tensor([[[[[[[[[[[[[[[[[[[[[[[[[nan],
                               [nan]]]]]]]]]]]]]]]]]]]]]]]]], device=device)
    if device=='cuda':
        yield ErrorInput(SampleInput(x, kwargs=dict(dim=(-1))),
                        error_type=RuntimeError,
                        error_regex='CUDA Tensors cannot have more than 25 dimensions')
    else:
        return

```
And

```
    OpInfo('median',
           ...
           error_inputs_func=error_inputs_median,
           ...

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91273
Approved by: https://github.com/ngimel
2023-01-03 03:06:33 +00:00
cce577b391 Revert D42257039: Multisect successfully blamed D42257039 for test or build failures (#91548)
Summary:
This diff is reverting D42257039
D42257039 has been identified to be causing the following test or build failures:

Tests affected:
- [assistant/neural_dm/rl/modules/tests:action_mask_classifier_test - main](https://www.internalfb.com/intern/test/281475048940766/)

Here's the Multisect link:
https://www.internalfb.com/intern/testinfra/multisect/1493969
Here are the tasks that are relevant to this breakage:
T93770103: 1 test started failing for oncall assistant_multimodal in the last 2 weeks
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

Test Plan: NA

Reviewed By: weiwangmeta

Differential Revision: D42272391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91548
Approved by: https://github.com/kit1980
2023-01-02 21:08:30 +00:00
fae821c2f1 fix inductor linspace when steps=1 (#91578)
Fixes #91506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91578
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-01-02 20:30:39 +00:00
0c3659586d Avoid copies in matmul (#76828)
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805

Fixes https://github.com/pytorch/pytorch/issues/76702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828
Approved by: https://github.com/ngimel
2023-01-02 20:07:38 +00:00
122245985a Dispatch the auxiliary frobenius_norm and nuclear_norm to better implementations and deprecate them (#81763)
These functions will be legacy functions. We deprecate them, but we also
take this chance to dispatch to a more efficient and consistent implementation.
Doing so should help writing a conversion rule for these to be able to
remove them once and for all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81763
Approved by: https://github.com/ngimel
2023-01-02 18:32:39 +00:00
b797a24259 Support indices contiguity per batch and non-contiguous values in sparse compressed tensors (#91243)
Fixes https://github.com/pytorch/pytorch/issues/91062

With this PR, all reported failures in https://github.com/pytorch/pytorch/pull/90849 are resolved (modulo test_bmm that uses an unorthodox way to construct a batch CSR tensor).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91243
Approved by: https://github.com/nikitaved, https://github.com/amjames, https://github.com/lezcano
2023-01-02 18:08:46 +00:00
dbf96164be [MPS] Add suport for casting updatesTensor directly in scatter (#91197)
Fixes copies into slices where the input data type is different than the output dtype.

This change removes the cast done before scatter, so we don't have to allocate additional memory to perform the casting. Scatter handles the casting directly now.

device = "mps"
shape = (4, 4)
tensor = torch.randint(10, shape, device=device)
tensor_before = tensor.clone()
res = torch.empty(shape[0], shape[1] * 2, device=device)[:, ::2].copy_(tensor)
torch.testing.assert_close(tensor, tensor_before)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91197
Approved by: https://github.com/razarmehr
2023-01-02 16:31:27 +00:00
34f2d3e6ae Deduplicate c10 error and PyTorchError hierarchy (#87855)
Fixes #53370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87855
Approved by: https://github.com/albanD
2023-01-02 15:53:36 +00:00
2b52db9c95 [xla hash update] update the pinned xla hash (#91087)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91087
Approved by: https://github.com/malfet
2023-01-02 11:12:19 +00:00
39d49dbe45 Revert "[cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)"
This reverts commit b407d98dbe1dda696d993150a89e4e46aa658168.

Reverted https://github.com/pytorch/pytorch/pull/89027 on behalf of https://github.com/kit1980 due to Fails test_cublas_workspace_explicit_allocation on ROCm
2022-12-31 23:04:57 +00:00
77c2a8a11f Clang-Tidy: Improve ctors by removing unnecessary copies and initializations (#91538)
Apply clang-tidy fixups to prefer member initializer and modernize-pass-by-value. This is a mostly a noop, but it should make a few ctors slighlty more readable and more efficient. Also drops in some missing moves that prevents a lot of unnecessary copying.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91538
Approved by: https://github.com/ezyang
2022-12-31 07:19:30 +00:00
eqy
b407d98dbe [cuBLAS] Fix default cuBLAS workspace size and parsing for multiple workspaces (#89027)
Follow-up of #86167 ; The number of pools was mistakenly ignored and the default workspace size appears to be too small to match selected cuBLAS kernels before the explicit allocation change.

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89027
Approved by: https://github.com/ngimel
2022-12-31 06:58:04 +00:00
f613633124 Remove _ignored_param_names (#91530)
'_ignored_param_names' is only used in 'param_hook' during state_dict() post hook processing to check a parameter key needs to be cloned or not. But it is not needed, as state_dict() post hook only passes fsdp managed parameter keys to 'param_hook', see https://github.com/pytorch/pytorch/blob/master/torch/distributed/fsdp/_state_dict_utils.py#L203. That means the passed parameter keys are always not part of '_ignored_param_names'.

so we should be able to safely remove '_ignored_param_names' and related codes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91530
Approved by: https://github.com/rohan-varma
2022-12-31 03:28:22 +00:00
6cef59487a [BE] Move internal only non-globbed lists to OSS (#91513)
Summary:
Should prevent internal only fixes that were required for https://github.com/pytorch/pytorch/pull/91104
Just moves the list to `build_variables.bzl` and makes it a sublist of aten_cpu_source_non_codegen_list

Test Plan: CI

Differential Revision: D42281502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91513
Approved by: https://github.com/kit1980, https://github.com/atalman
2022-12-31 00:02:43 +00:00
73436af43f [cuDNN][cuDNN V8 API] Improve hot path heuristics performance in V8 (#90811)
Small optimization for the hot path when thrashing the cache with dynamic shapes; in most cases we don't need the fallback generator so we can omit it unless needed later.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90811
Approved by: https://github.com/ngimel
2022-12-30 23:39:49 +00:00
bc92444b34 Rename torchtriton (#91539)
to `pytorch-triton`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91539
Approved by: https://github.com/seemethere, https://github.com/soumith
2022-12-30 22:49:17 +00:00
62713636d8 Bump protobuf from 3.20.1 to 3.20.2 in /.github/requirements (#91540)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.1 to 3.20.2.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/protocolbuffers/protobuf/releases">protobuf's releases</a>.</em></p>
<blockquote>
<h2>Protocol Buffers v3.20.2</h2>
<h1>C++</h1>
<ul>
<li>Reduce memory consumption of MessageSet parsing</li>
<li>This release addresses a <a href="https://github.com/protocolbuffers/protobuf/security/advisories/GHSA-8gq9-2x98-w8hf">Security Advisory for C++ and Python users</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="a20c65f2cd"><code>a20c65f</code></a> Updating changelog</li>
<li><a href="c49fe79af9"><code>c49fe79</code></a> Updating version.json and repo version numbers to: 20.2</li>
<li><a href="806d7e4ce6"><code>806d7e4</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10544">#10544</a> from deannagarcia/3.20.x</li>
<li><a href="ae718b3902"><code>ae718b3</code></a> Add missing includes</li>
<li><a href="b4c395aaed"><code>b4c395a</code></a> Apply patch</li>
<li><a href="6439c5c013"><code>6439c5c</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10531">#10531</a> from protocolbuffers/deannagarcia-patch-7</li>
<li><a href="22c79e6e4c"><code>22c79e6</code></a> Update version.json</li>
<li><a href="c1a2d2ec29"><code>c1a2d2e</code></a> Fix python release on macos (<a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10512">#10512</a>)</li>
<li><a href="a826282e15"><code>a826282</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/protocolbuffers/protobuf/issues/10505">#10505</a> from deannagarcia/3.20.x</li>
<li><a href="7639a710e1"><code>7639a71</code></a> Add version file</li>
<li>Additional commits viewable in <a href="https://github.com/protocolbuffers/protobuf/compare/v3.20.1...v3.20.2">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=protobuf&package-manager=pip&previous-version=3.20.1&new-version=3.20.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91540
Approved by: https://github.com/huydhn
2022-12-30 20:31:27 +00:00
fdbbd20f32 Cache conda and pip for IOS CI (#91359)
Fixes T137630520

Caching for conda and pip dependencies for iOS CI workflow.

- Conda and pip dependencies have been moved from [_ios-build-test.yml](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_ios-build-test.yml) to dedicated requirements files
- Miniconda shell installation has been replaced by `setup-miniconda@main` which supports caching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91359
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-30 17:52:20 +00:00
af589b3d1f switch causal mask for is_causal flag (#91171)
Summary: switch causal mask for is_causal flag

Test Plan: sandcastle & github

Differential Revision: D42089340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91171
Approved by: https://github.com/wushirong, https://github.com/drisspg
2022-12-30 17:24:58 +00:00
cyy
9710ac6531 Some CMake and CUDA cleanup given recent update to C++17 (#90599)
The main changes are:
1. Remove outdated checks for old compiler versions because they can't support C++17.
2. Remove outdated CMake checks because it now requires 3.18.
3. Remove outdated CUDA checks because we are moving to CUDA 11.

Almost all changes are in CMake files for easy audition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90599
Approved by: https://github.com/soumith
2022-12-30 11:19:26 +00:00
d5163f5206 Fix NumPy broadcasting in lstsq_backward (#91460)
Fixes https://github.com/pytorch/pytorch/issues/77225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91460
Approved by: https://github.com/albanD
2022-12-30 10:49:20 +00:00
051d16a2f7 Fix NumPy-compat broadcasting in the derivative of linalg.solve (#91456)
Fixes https://github.com/pytorch/pytorch/issues/89761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91456
Approved by: https://github.com/albanD
2022-12-30 10:49:20 +00:00
484dd40022 Implement PReLU in a compositional way (#91238)
The PReLU implementation was all over the place. This lead to a number
of bugs like https://github.com/pytorch/pytorch/issues/68760.  We fix it by:
- Keeping the weird broadcasting logic it has as a CompositeImplicit kernel that calls into a second kernel
- This second kernel is just a good-ol' pointwise kernel.
- We implement the derivative for the pointwise kernel via TI as well for speed.
- We implement the second derivative for the pointwise kernel and the forward AD derivatives compositionally

This fixes a number of issues:
- We don't perform copies any more when the inputs are not contiguous
- The derivatives are now correct
- We fix vmap and many other functorch-related issues.
- CPU and CUDA now share the relevant broadcasting logic
- The implementation is about 1/3 the length.

Fixes https://github.com/pytorch/pytorch/issues/68760
Fixes https://github.com/pytorch/pytorch/issues/89895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91238
Approved by: https://github.com/kshitij12345, https://github.com/jbschlosser, https://github.com/albanD
2022-12-30 10:42:30 +00:00
0e8565d1d5 [FSDP][optim_state_dict][8/N] Enable fully_shard optim state_dict save and load (#91234)
**What does this PR do?**
This PR refactor `_optim_utils.py` to use `_FSDPState` instead of `FullyShardedDataParallel` class. This change enables the support of optim state_dict for `fully_shard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91234
Approved by: https://github.com/rohan-varma
2022-12-30 06:56:44 +00:00
f8740db410 Properly resolve source_ref when constructing shape guards (#91058)
Whenever you guard on something, you're supposed to tell GuardBuilder about it, so GuardBuilder knows that it has to actually bind it in scope when it creates the guard function. But shape env guards bypass that mechanism completely. Well, now they don't.

For the most part, this didn't matter in practice, because we usually had a `TENSOR_MATCH` guard floating around that made sure that the guard stayed live. But if we ever eliminate those guards (e.g., because we build it into the shape guard directly; something we'll probably want to do when https://github.com/pytorch/pytorch/pull/89707 goes online) then this will indeed matter.

One complication: some of the shape env guards are on globals. You have to make sure to shunt the usage to the correct guard builder in that case. Maybe it would be better if we refactored things so there is only one GuardBuilder. Not sure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91058
Approved by: https://github.com/voznesenskym
2022-12-30 05:56:56 +00:00
bcf15cd93b Store source, not sname, in Symbol (#91057)
I'm going to need this in the follow up PR. Instead of storing only Source.name() in Symbol, I now store a full on Source. Lots of replumbing reoccurs. In particular:

- Move Source to torch._guards to break cycles
- I have to add TensorPropertySource and NegateSource to handle x.size()[0] and -x codegen that I was doing with string manipulation previously
- I tighten up invariants so that I never pass source=None; instead I pass ConstantSource (these are constant sources right) and test for that rather than source being missing. I think this is more parsimonious
- Some mypy wobbles from new imports

I didn't move LocalSource and friends to torch._guards, but I ended up needing to access them in a few places. The main annoyance with moving these is that then I also need to move the bytecode codegen stuff, and that's not so easy to move without bringing in the kitchen sink.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91057
Approved by: https://github.com/albanD, https://github.com/voznesenskym, https://github.com/zou3519
2022-12-30 05:56:56 +00:00
2edf589e66 [Profiler] Fix SOFT_ASSERT test to not raise on debug builds (#91464)
Summary: There was a patch to not raise SOFT_ASSERT in debug builds. Update this test to match it.

Test Plan: This test passes after this patch.

Differential Revision: D42270123

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91464
Approved by: https://github.com/robieta
2022-12-30 05:31:03 +00:00
eqy
946e57704e Drop compute capability < 5.0 in CUDA 12 (#91213)
CC @ptrblck @crcrpar

#91122
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91213
Approved by: https://github.com/ngimel
2022-12-30 04:53:05 +00:00
31e66ca4ef [torch.func] Add docs (#91319)
Docs copy-pasted from functorch docs with minor adjustments. We are
keeping the functorch docs for BC, though that's up for debate -- we
could also just say "see .. in torch.func" for some, but not all doc
pages (we still want to keep around any examples that use
make_functional so that users can tell what the difference between that
and the new functional_call is).

Test Plan:
- docs preview
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91319
Approved by: https://github.com/samdow
2022-12-30 02:51:18 +00:00
6f034dc0b0 (non-batch) BSR/BSC to COO performance improvement. (#91389)
This PR improves the aforementioned conversions by reducing memory footprint and the number of kernels run, and also by removing the sync imposed by `at::where(condition)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91389
Approved by: https://github.com/pearu, https://github.com/kit1980
2022-12-30 00:04:50 +00:00
b1bdec83c9 Clang-Tidy: Prevent implicit promotion in math functions (#91450)
Ensure that accidental promotions in functions does not accidentally occur

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91450
Approved by: https://github.com/ezyang
2022-12-29 23:44:17 +00:00
1c3bb2fdb0 Chore: fix clang warning - mismatched tags (#91455)
Fixes a clang warning about mismatched tags when building PyTorch. Seems like an easy fix / oversight.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91455
Approved by: https://github.com/ezyang
2022-12-29 23:43:50 +00:00
a34a9c3471 Perf: Apply more clang-tidy fixups to torch headers (#91445)
Applies so more fixes to headers that may have been missed before for performance optimization.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @EikanWang @ezyang since this more in the series of the clang-tidy fixup

This is PR fixes 3 main issues:
1. Use emplacement more in headers
1. Avoid unnecessary copies and use const ref when possible
1. Default any special functions when possible to make them potentially trivial and more readable.
1. There is also one change in this PR that tries to prevent unnecessary math promotion, the rest of these changes are in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91445
Approved by: https://github.com/ezyang
2022-12-29 23:43:45 +00:00
553b592824 Clang-Tidy: use modern for each loops and transparent functors (#91449)
This applies some more clang-tidy fixups. Particularly, this applies the modernize loops and modernize-use-transparent-functors checks. Transparent functors are less error prone since you don't have to worry about accidentally specifying the wrong type and are newly available as of C++17.

Modern foreach loops tend be more readable and can be more efficient to iterate over since the loop condition is removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91449
Approved by: https://github.com/ezyang
2022-12-29 23:37:51 +00:00
b8ba4802fe Add an option to skip loading of debug traces (#91430)
Summary:
Debug traces consumes lots of memory especially for small models.

Test Plan:
Unit test

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91430
Approved by: https://github.com/davidberard98
2022-12-29 22:53:17 +00:00
6ec3d65b0c Automated submodule update: FBGEMM (#90489)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 81ba6c51ec

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90489
Approved by: https://github.com/malfet
2022-12-29 21:25:25 +00:00
3ac6106523 Add out of bounds checks inside irparser.cpp and unpickler.cpp (#91401)
Hi!

I've been fuzzing different pytorch modules, and found a few crashes.

Inside unpickler.cpp/irparser.cpp there are a few places, where `.at()` and `.pop_back()` are called before checking target container size. Lack of these checks results in an attempt to access elements oob (in case of `.at()`), and an actual out-of-bounds access while calling `.pop_back()`/`.pop()` on a `stack_` variable.

Crash-files:

1. Crash location: `unpickler.cpp:439` (Call to `.at(idx)` with idx that exceeds `memo_table_` size).
    - Reproduce the crash: `/message_deserialize_fuzz /homedir/crash-5695ad5b2921127775d4137ee02e23834a0bedc4`
    - Crash file: [crash-5695ad5b2921127775d4137ee02e23834a0bedc4.zip](https://github.com/pytorch/pytorch/files/10308463/crash-5695ad5b2921127775d4137ee02e23834a0bedc4.zip)
    - ASAN report: [asan-report-crash-5695ad5b2921127775d4137ee02e23834a0bedc4.log](https://github.com/pytorch/pytorch/files/10308612/asan-report-crash-5695ad5b2921127775d4137ee02e23834a0bedc4.log)

2. Crash location: `irparser.cpp:504` (Call to `.at(idx)` with idx that exceeds `schema->returns()` size).
    - Reproduce the crash: `/irparser_fuzz /homedir/crash-779ecab3d637c8c87de21e23dddb9def82a26792`
    - Crash file: [crash-779ecab3d637c8c87de21e23dddb9def82a26792.zip](https://github.com/pytorch/pytorch/files/10308475/crash-779ecab3d637c8c87de21e23dddb9def82a26792.zip)
    - ASAN report: [asan-report-crash-779ecab3d637c8c87de21e23dddb9def82a26792.log](https://github.com/pytorch/pytorch/files/10308611/asan-report-crash-779ecab3d637c8c87de21e23dddb9def82a26792.log)

3. Crash location: `unpickler.cpp:451` (Call to `.pop_back()` with empty `stack_`).
    - Reproduce the crash: `/message_deserialize_fuzz /homedir/crash-735acc19c9f39b9bbb5667878af995c9167da37f`
    - Crash file: [crash-735acc19c9f39b9bbb5667878af995c9167da37f.zip](https://github.com/pytorch/pytorch/files/10308565/crash-735acc19c9f39b9bbb5667878af995c9167da37f.zip)
    - ASAN report: [asan-report-crash-735acc19c9f39b9bbb5667878af995c9167da37f.log](https://github.com/pytorch/pytorch/files/10308558/asan-report-crash-735acc19c9f39b9bbb5667878af995c9167da37f.log)

4. Crash location: `unpickler.cpp:469` (Call to `.pop()` with empty `stack_`).
    - Reproduce the crash: `/message_deserialize_fuzz /homedir/crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9`
    - Crash file: [crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.zip](https://github.com/pytorch/pytorch/files/10308568/crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.zip)
    - ASAN report: [asan-report-crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.log](https://github.com/pytorch/pytorch/files/10308555/asan-report-crash-b552f1a2bbba5eab0f6aeba58475175b18e5b1b9.log)

The provided patch adds missing size checks.

### How to reproduce

1. To reproduce the crashes, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/blob/master/projects/pytorch/Dockerfile)

6. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .`

7. Copy crash file to the current directory

8. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash``

9. And execute fuzz-targets with the given arguments

After execution completes you will see ASAN reports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91401
Approved by: https://github.com/davidberard98
2022-12-29 19:58:29 +00:00
0417da2288 Set a timeout value when testing multiprocess DataLoader (#91476)
Setting a timeout value when testing multiprocess DataLoader to prevent ASAN jobs timing out after 4 hours.

We are seeing multiple timeout issue running ASAN tests on HUD https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=asan for examples

* Without mem leak check enabled https://github.com/pytorch/pytorch/actions/runs/3794216079/jobs/6455118197
* With mem leak check https://github.com/pytorch/pytorch/actions/runs/3792743994/jobs/6449356306

Looking a bit closer into the test, the hanging happens when multiprocess DataLoader is used in `test_utils`.  Here is the snapshot of those processes when I log into the hang runner:

```
UID        PID  PPID  C STIME TTY          TIME CMD
jenkins      1     0  0 Dec28 pts/0    00:00:00 bash
jenkins      8     0  0 Dec28 pts/1    00:00:00 sh -c pip install dist/torch-2.0.0a0+git97db9fd-cp37-cp37m-linux_x86_64.whl[opt-einsum] && .jenkins/pytorch/test.sh
jenkins     20     8  0 Dec28 pts/1    00:00:00 /bin/bash .jenkins/pytorch/test.sh
jenkins    764    20  0 Dec28 pts/1    00:00:07 python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 5 5 --verbose
jenkins    788   764  0 Dec28 pts/1    00:00:00 /opt/conda/bin/python -c from multiprocessing.semaphore_tracker import main;main(6)
jenkins   3743   764  0 Dec28 pts/1    00:00:05 /opt/conda/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=7, pipe_handle=11) --multiprocessing-fork
jenkins   3766  3743  0 Dec28 pts/1    00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins   3878  3766  0 Dec28 pts/1    00:00:06 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins   3879  3766  0 Dec28 pts/1    00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins   3880  3766  0 Dec28 pts/1    00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins   3881  3766  0 Dec28 pts/1    00:00:00 /opt/conda/bin/python -bb test_utils.py -v --import-slow-tests --import-disabled-tests
jenkins   3893     0  0 01:45 pts/2    00:00:00 /bin/bash
jenkins   3904  3893  0 01:46 pts/2    00:00:00 ps -ef
```

The specific hanging test was `test_random_seed` which spawned 4 subprocesses to load data.  After I killed one of them, the test could continue and printed the following stacktrace:

```
    test_random_seed (__main__.TestDataLoaderUtils) ... [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  ERROR (9345.840s)
    test_random_seed (__main__.TestDataLoaderUtils) ...     test_random_seed errored - num_retries_left: 3
  Traceback (most recent call last):
    File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
      data = self._data_queue.get(timeout=timeout)
    File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 104, in get
      if not self._poll(timeout):
    File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 257, in poll
      return self._poll(timeout)
    File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
      r = wait([self], timeout)
    File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 921, in wait
      ready = selector.select(timeout)
    File "/opt/conda/lib/python3.7/selectors.py", line 415, in select
      fd_event_list = self._selector.poll(timeout)
    File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
      _error_if_any_worker_fails()
  RuntimeError: DataLoader worker (pid 3878) is killed by signal: Terminated.
  The above exception was the direct cause of the following exception:
  Traceback (most recent call last):
    File "test_utils.py", line 469, in test_random_seed
      x2 = run()
    File "test_utils.py", line 464, in run
      return next(iter(dataloader))
    File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 635, in __next__
      data = self._next_data()
    File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1330, in _next_data
      idx, data = self._get_data()
    File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1296, in _get_data
      success, data = self._try_get_data()
    File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1147, in _try_get_data
      raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
  RuntimeError: DataLoader worker (pid(s) 3878) exited unexpectedly
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  [W ParallelNative.cpp:230] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
  ok (0.137s)
```

This doesn't fix the issue which I'll need to follow up to see why they hang.  However, this should allow the test to terminate gracefully and report errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91476
Approved by: https://github.com/kit1980
2022-12-29 17:50:37 +00:00
bc764f453d Fix sharded_tensor test_sharded_tensor_to_cpu (#91453)
Fixes https://github.com/pytorch/pytorch/issues/91381

Assert needs to be updated in the test. Run `ciflow/periodic` to run the multigpu tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91453
Approved by: https://github.com/clee2000
2022-12-29 13:21:30 +00:00
5030929c5d add channels last with mixed data type support for GroupNorm backward (#89485)
### Motivation
1. Add channels last support for GroupNorm backward to make sure GroupNorm fully support channels last.
2. Same as #88663, mixed data type support is also needed for channels last implementation of GroupNorm backward.

### Testing
Single socket (28cores):

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.20E-05 | 3.60E-05 | 8.31E-05 | 8.13E-05
[10, 128, 50, 50] | 0.000126 | 0.000115 | 0.000356 | 0.000257

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.11E-05 | 4.12E-05 | 9.74E-05 | 9.66E-05
[10, 128, 50, 50] | 0.000179 | 0.000178 | 0.000393 | 0.000317

Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.47E-04 | 2.53E-04 | 5.92E-04 | 4.50E-04
[10, 128, 50, 50] | 0.001559 | 0.001384 | 0.004343 | 0.002436

* Channels Last:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.27E-04 | 3.24E-04 | 0.0006224 | 0.000459
[10, 128, 50, 50] | 0.00167 | 0.001278 | 0.0041858 | 0.003027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89485
Approved by: https://github.com/jgong5, https://github.com/malfet
2022-12-29 07:19:39 +00:00
ad782ff7df Enable xdoctest runner in CI for real this time (#83816)
Builds on #83317 and enables running the doctests. Just need to figure out what is causing the failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83816
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-29 05:32:42 +00:00
eqy
fb4fc0dabe [CUDA] Bump version requirement for CUDA Graphs debug dump function (#91429)
#91417

CC @ptrblck @vors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91429
Approved by: https://github.com/ngimel
2022-12-29 03:44:42 +00:00
9b144ddbe4 Make input casting in root module only in default (#91365)
Make input casting in root module only in default, meanwhile allowing to set different mixed precisions for different submodules
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91365
Approved by: https://github.com/awgu
2022-12-29 03:20:32 +00:00
3d8834bdbf SymIntify F.interpolate() with recompute_scale_factor=True (#91318)
This PR makes the minor changes necessary to get `F.interpolate()` working with symbolic shapes when `recompute_scale_factor=True` + adds `OpInfo` samples to test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91318
Approved by: https://github.com/ezyang
2022-12-29 01:42:56 +00:00
dbd0d76515 Disable test_fs family for dynamo (#91459)
This should help address https://github.com/pytorch/pytorch/issues/67002.  At the end of these tests, any temp file `/dev/shm/torch_*` are cleaned up, but somehow it might take longer than 0.5s to finish causing the test to fail.  So, the PR tries to increase this max waiting time to 5s while polling for the result every 0.5s as before

### Testing
`pytest test_multiprocessing.py -k test_fs --verbose --flake-finder` to run `test_fs`, `test_fs_is_shared`, `test_fs_pool`, `test_fs_preserve_sharing`, and `test_fs_sharing` 50 times on a dynamo shard.  All passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91459
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/atalman
2022-12-29 00:26:57 +00:00
f012d0ea5b [autograd.Function] enable the extended Function feature flag by default (#91441)
The autograd.Function <> functorch interaction is in a mostly completed
state now. There are some minor action items remaining
(https://github.com/pytorch/pytorch/issues/90224), but I want to enable
the feature by default so that PyTorch CI / other parties / etc can
begin testing to see if there is any impact on the original
autograd.Function API (there shouldn't be).

The longer-term plan for the feature flag is:
- keep it around until at least the next release (so that people can
turn off the feature if it breaks something in existing code)
- delete the flag then (either before or after the release, I haven't
decided yet)

Test Plan:
- new test
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91441
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-12-28 21:00:27 +00:00
ae52750d91 Reduce hook registration code duplication (#91418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91418
Approved by: https://github.com/albanD
2022-12-28 20:52:04 +00:00
8191c49f82 Update links in writing_batching_rules.md (#91354)
Update links to fit the code migration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91354
Approved by: https://github.com/zou3519
2022-12-28 19:50:34 +00:00
08a47549af Rename Tensor._storage to Tensor.untyped_storage and update docs (#91414)
Fixes #89224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91414
Approved by: https://github.com/ezyang
2022-12-28 19:21:34 +00:00
5b223c43ec Avoid calling allclose in the backward if there are tensor subclasses (#91444)
`allclose` it's data-dependent (returns a bool) so it does not play well
with functorch. We are skipping that check in the context of subclasses
to avoid hard errors.

Partially fixes https://github.com/pytorch/pytorch/issues/90499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91444
Approved by: https://github.com/albanD
2022-12-28 19:12:50 +00:00
4444138fae Add backward for complex numbers for diagonal_scatter (#91443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91443
Approved by: https://github.com/soulitzer
2022-12-28 19:12:50 +00:00
f969834f68 [functorch] vmap: nansum & nanmean (#91372)
Fixes #91174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91372
Approved by: https://github.com/zou3519
2022-12-28 18:49:49 +00:00
d7674e70f4 Fix for tryrebase after PR was merged (#91337)
rebasing certain merged prs results in the rebased branch pointing at the target branch b/c git believes the pr has already been included in the branch.  Git does not replay the changes onto the target branch because the change is already in the target branch

This usually affects PRs with only 1 commit (more commits -> trymerge squashes them when merged -> git believes that the change is not in the target branch b/c the squashed commit is different from the individual changes).

It might also affect ghstack changes b/c behind the scenes the ghstack PRs are all contained within one commit on the orig branch, but I'm not sure about this.

helps w/ https://github.com/pytorch/test-infra/issues/836
looks like https://github.com/clee2000/random-testing/pull/44#issuecomment-1363439534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91337
Approved by: https://github.com/ZainRizvi
2022-12-28 18:44:08 +00:00
cc11edb084 [aot_autograd] symintify logsumexp (#91442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91442
Approved by: https://github.com/albanD
2022-12-28 18:06:26 +00:00
f5e20d6060 Make the state dict of CyclicLR scheduler pickleable (#91400)
Fixes #90414

This PR drops the unpicklable `weakref.WeakMethod` object from CyclicLR scheduler from the state dict, and re-inits the object again once the state dict gets loaded. This makes the state picklable so you can include it in your checkpoint. Also fixes https://github.com/Lightning-AI/lightning/issues/15901

A simple test was added that `pickle.dumps(state)` the state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91400
Approved by: https://github.com/albanD
2022-12-28 18:05:24 +00:00
896aa72359 check_forward_backward_compatibility C10D APIs (#91409)
Remove APIs from check since they aren't being updated anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91409
Approved by: https://github.com/awgu
2022-12-28 17:37:12 +00:00
8b55b86dbd Move sym_int and sym_float alongside SymInt / SymFloat in base torch package (#91317)
This PR moves the definitions for:
* `sym_int`
* `sym_ceil` (used only for `sym_int`)
* `sym_floor` (used only for `sym_int`)
* `sym_float`

from `torch/fx/experimental/symbolic_shapes.py` to `torch/__init__.py`, where `SymInt` and `SymFloat` are already defined.

This removes the need for several in-line imports, and enables proper JIT script gating for #91318. I'm very open to doing this in a better way!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91317
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2022-12-28 16:08:16 +00:00
1c40ec46ff Decomps and meta registrations for upsample_nearest 1D / 2D / 3D (#91260)
Adds decompositions and meta registrations for the 1D, 2D, and 3D implementations of `upsample_nearest`. All related OpInfo-based tests for AOTAutograd now pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91260
Approved by: https://github.com/ezyang
2022-12-28 16:03:25 +00:00
f1d8fef4d4 Softmax added to tensor, torch and docs (#91292)
Fixes #91107

Added `softmax` docs in

- `pytorch/torch/_tensor_docs.py`
- `pytorch/torch/_torch_docs.py `
- `pytorch/docs/XXX.rst` files. Here XXX represents all those files where I made the change

Although I have added `softmax` in `docs` directory, I was not sure which files/folders required the edits so there could be issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91292
Approved by: https://github.com/lezcano
2022-12-28 15:06:24 +00:00
af7132302a Revert "Softmax added to tensor, torch and docs (#91292)"
This reverts commit f8b28799f8432ab8de6c960eef4d530f45af1a5b.

Reverted https://github.com/pytorch/pytorch/pull/91292 on behalf of https://github.com/weiwangmeta due to breaking internal distributed testing builds
2022-12-28 14:30:46 +00:00
3066edbc60 [Inductor] fix undefined MockHandler use (#91434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91434
Approved by: https://github.com/soumith
2022-12-28 14:10:23 +00:00
9f91e94080 Workaround for NumPy builds that ship with a broken Dlpack deleter (#89759)
NumPy versions 1.22 and 1.23 (and their respective bugfix releases included) have a buggy implementation of the Dlpack deleter that doesn't account for no-GIL contexts. Since we now release the GIL when deallocating tensors in `THPVariable_clear`, this leads to a failure of internal consistency checks when freeing a Dlpack-backed tensor from NumPy.

This PR adds a check for the buggy NumPy versions and overrides the `DlManagedTensor` deleter to reacquire the GIL before deallocation.

### Rationale for this implementation
The version check was added to `tensor_numpy.h/cpp` as it seemed like a more logical location for it than creating a new translation unit. The overriding of the deleter was originally attempted by directly modifying `at::fromDlpack`, but the lack of a build dependency on the Python C API in A10 prevented that. So, I extended the A10 Dlpack API instead to additionally accept a custom deleter functor.

Fixes #88082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89759
Approved by: https://github.com/albanD
2022-12-28 13:23:29 +00:00
41a0318f2d Remove overload at::frobenius_norm(const Tensor&) (#81762)
This function is an auxiliary function for `torch.norm`. This particular
overload was not even used or tested. I hope it's not used internally
either. If it is, we can simply drop this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81762
Approved by: https://github.com/ngimel
2022-12-28 13:12:01 +00:00
274d3b24c3 use scatter_add for index_add when dim is the most inner dim (#88729)
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.

`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.

### Testing

- Single core:

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003

- Single socket (28 cores):

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2022-12-28 12:04:17 +00:00
700941f683 Fixup c10 headers with clang-tidy (#91407)
Clang-tidy was not applied properly to headers in c10 as documented #91406. These are the easy automated fixes that came out of applying clang-tidy to the c10 part of the code base. cc @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91407
Approved by: https://github.com/ezyang
2022-12-28 11:12:22 +00:00
c470ad4f4a Add missing overload for ivalue toSym(Int|Float) (#91405)
Noticed the toSymFloat / toSymInt overloads always copied the internal pointer of an ivalue even if it was an rvalue unlike other overloads (like toTensor). This fixes that issue by adding the appropriate methods needed to facilitate that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91405
Approved by: https://github.com/ezyang
2022-12-28 11:07:37 +00:00
b416d50502 [inductor] Fix "RuntimeError: Tried to erase Node permute but it still had 3 users in the graph" (#91327)
Summary: Fix "RuntimeError: Tried to erase Node permute but it still had 3 users in the graph" to unblock internal models

Differential Revision: D42213859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91327
Approved by: https://github.com/ngimel, https://github.com/anijain2305, https://github.com/jianyuh
2022-12-28 10:21:49 +00:00
22a718b40b [LTC] Restore LazyTensor() = delete (#91426)
Summary:
XLA's LTC migration is completed. Let's restore some hacks.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91426
Approved by: https://github.com/JackCaoG
2022-12-28 09:21:55 +00:00
3fdbf824ae [functorch] jacrev: chunk_size=1 without vmap (#91326)
As discussed at https://github.com/pytorch/pytorch/pull/91157#discussion_r1053679272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91326
Approved by: https://github.com/zou3519
2022-12-28 04:56:25 +00:00
878719a2db initialise the members boolean_ and integer_ of at::indexing::TensorIndex (#91399)
initialise the members boolean_ and integer_ of at::indexing::TensorIndex to false and 0 respectively, because the compiler generated copy-ctor accesses them which is UB.  This resolves a compile time warning, a runtime error from UBSan + gcc, and a runtime error from MSVC when compiling debug.

Fixes #90951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91399
Approved by: https://github.com/bdhirsh
2022-12-28 04:23:32 +00:00
1b2ee4d0e1 Update functorch supported autograd.Function to allow mark_dirty (#91222)
Fixes https://github.com/pytorch/pytorch/issues/90225
Uses what was originally in 32a57bcdb6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91222
Approved by: https://github.com/zou3519
2022-12-28 03:53:47 +00:00
ca39c5b04e Fix conda install on distributions with strict POSIX sh (#91371)
See also https://github.com/conda/conda/issues/10431

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91371
Approved by: https://github.com/albanD
2022-12-28 00:25:03 +00:00
2e79d46708 Revise error reporting when TorchInductor cannot access /tmp folder (#91385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91385
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/anijain2305
2022-12-28 00:23:44 +00:00
0b709b4816 [FSDP][Easy] Fix context manager syntax (#91410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91410
Approved by: https://github.com/kit1980
2022-12-28 00:17:55 +00:00
e8393131ee [generate_vmap_rule] support for jvp (#91211)
Support for jvp is very similar to support for backward():
- We need to vmap over a version of the original autograd.Function's jvp
method that does not take ctx as input.
- On the output, we need to reductify to ensure the output tangent has
the same shape as the output. This reductify does not have the
extra reduction semantics, because PyTorch forward-mode AD requires the
output tangent to have the same exact shape as the output.
- setup_context needs to tell us the bdims of the saved_tensors
(necessary for vmap over jvp_no_context), as well
as the output shapes (necessary for reductify).

Test Plan:
- Added jvp support to the *GenVmapAutogradFunction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91211
Approved by: https://github.com/soulitzer
2022-12-27 23:25:59 +00:00
48e63bf69f [functorch] composition of three transform tests with jvp (#91206)
This PR adds the following tests. They will be useful as test cases for
generate_vmap_rule=True and jvp (to come soon)
- test_jvpvmap
- test_jvpvmapvmap
- test_vmapjvpvmap
- test_jvpjvpvmap
- test_jvpvjpvmap
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91206
Approved by: https://github.com/soulitzer
2022-12-27 23:25:59 +00:00
1768a28a20 COO @ COO: fix to always produce coalesced outputs. (#91094)
Fixes [#90516](https://github.com/pytorch/pytorch/issues/90516)
Fixes [#90538](https://github.com/pytorch/pytorch/issues/90538)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91094
Approved by: https://github.com/pearu
2022-12-27 21:32:14 +00:00
67c53d50e5 Revert "Fix conda install on distributions with strict POSIX sh (#91371)"
This reverts commit 57dcd93c4103c6db043f341a0242596a42188081.

Reverted https://github.com/pytorch/pytorch/pull/91371 on behalf of https://github.com/kit1980 due to trunk / cuda11.6-py3.10-gcc7-sm86 / test (slow, 1, 2, linux.g5.4xlarge.nvidia.gpu) started to fail after this PR with mypy error
2022-12-27 19:51:59 +00:00
81b3df4fb0 Fix dtype mismatch for unallocated storage deserialization (#91285)
Fixes #90497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91285
Approved by: https://github.com/ezyang
2022-12-27 19:31:09 +00:00
93a810b045 Add dim checks for internal embedding_bag functions (#85433)
Fixes #85213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85433
Approved by: https://github.com/malfet
2022-12-27 19:27:33 +00:00
467d269ad1 Minor fix in package exporter (#90306)
Summary:
As title.
Saw this while working on another diff.
`storage` won't be defined in the `else` case. But this causes pyre to freak out.

Test Plan: Unit tests.

Differential Revision: D41751229

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90306
Approved by: https://github.com/PaliC
2022-12-27 18:01:59 +00:00
06bdd491fb [vmap] fix reduction boxed batching rules (#91109)
Fixes https://github.com/pytorch/pytorch/issues/91041

There's a bug in our boxed reduction batching rules for a very specific
case: vmap over a Tensor of shape [1] for an operation where the
output rank is supposed to be less than the input rank, e.g.

```
x = torch.tensor([10.], device=device)
y = vmap(lambda x: x.sum(0))(x)
```

The boxed reduction batching rule handles three types of "reduction"
operations:
- reduction operations with an optional keepdim argument, which
specifies if the output should have the same or smaller rank than the
input
- reduction operations without a keepdim arg that morally have keepdim=True (like cumsum --
which never actually modifies the rank of the tensor but is still a
"reduction" since it sums a bunch of things together)
- reduction operations without a keepdim arg that morally have
keepdim=False. (just torch.count_nonzero).

Furthermore, PyTorch has special handling for scalar tensors (e.g.
tensors of shape []). It is valid to do
`torch.sum(torch.tensor(10.), dim=0)`.

This PR updates the `boxed_reduction_batch_rule` to handle the
interaction between the three kinds of reduction and the scalar tensor
cases correctly. Concretely, it:
- introduces additional templates to `boxed_reduction_batch_rule` for
what type of "keepdim" reduction this is.
- splits the old REDUCTION_BOXED macro (which was a good default) into
REDUCTION_NO_KEEPDIM_ARG and REDUCTION_WITH_KEEPDIM_ARG (which are also
opionated defaults) and uses them.

Test Plan:
- Given an input of shape [], our vmap OpInfo test suite only produces
a Tensor of shape [B] with B = 2. At first glance this doesn't look
sufficient to test this case (vmap over Tensor[1]), but the claim is
that it is because the boxed_reduction_batch_rule is agnostic to the shape
of the dimension being vmapped over. Previously it was not due to
the semantics of `squeeze`; this PR adds internal asserts to make it agnostic.
- there is a light test for vmap over the Tensor of shape [1] for
torch.sum as a sanity check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91109
Approved by: https://github.com/samdow
2022-12-27 14:40:15 +00:00
255d14947d Fix resource consumption in reductions (#89144)
Reductions along a (large enough) contiguous dimension vectorise the
loading of the inputs. This vectorisation was not taken into account
when computing the necessary resources for the kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89144
Approved by: https://github.com/zasdfgbnm, https://github.com/ngimel
2022-12-27 12:02:14 +00:00
1c681f4bd8 Fix distutils.LooseVersion DeprecationWarning (#88524)
Fixes #84712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88524
Approved by: https://github.com/MaKaNu, https://github.com/milutter, https://github.com/soumith
2022-12-27 11:46:00 +00:00
97db9fde69 Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178)
…c10d

Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178
Approved by: https://github.com/ezyang
2022-12-27 07:34:12 +00:00
bb24185ff4 Fix _check_no_differentiable_outputs for forward ad (#91391)
This `is_forward_ad` isn't propagated, which leads to this line creating a
slow-gradcheck failure on master:
```
    if not is_forward_ad and any(o.is_complex() for o in outputs):
        raise ValueError("Expected output to be non-complex. get_numerical_jacobian no "
                         "longer supports functions that return complex outputs.")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91391
Approved by: https://github.com/albanD
2022-12-27 03:52:05 +00:00
a061f139dc [optim] Adam defaults to fused when CUDA + differentiable=False (#90865)
Step 1 in faster default optimizers.

Preliminary benchmarks show gaps in improvement on CUDA for BERT_pytorch and resnet18:
![image](https://user-images.githubusercontent.com/31798555/207707118-14221802-77ce-4ee0-96e3-04638c07924c.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90865
Approved by: https://github.com/albanD
2022-12-27 01:28:47 +00:00
0b255b3f80 Better __repr__ for ModuleList (#90452)
## Problem
When models have a lot of complex repeated layers, `print(module)` output becomes unfeasible to work with. For example, current output of `__repr__` for `t5-small` is `715 ` lines long.

## Solution
Using better `__repr__` it becomes `135`. For `t5-large`, current `__repr__` prints `1411` lines. Better `__repr__` — `135`. Same numer as for t5-small, because most of the layers are just repeated. For `EleutherAI/gpt-j-6B` number of lines reduces form `483` to just `24`.

Here's how it works: when ModuleList items have exactly the same `__repr__` instead of printing both of them, it prints f`N x {repr(item)}`. Current code supports cases when the same ModuleList has multiple repeating items, which is especially useful when first/last layer of a block is different from the reset of them.

Better `__repr__` should make model prints smaller, more beautiful and significantly more useful by highlighting the difference between repeated blocks instead of losing it in a wall of text.

## Motivating real-life example.

You can try it out in this [colab notebook](https://colab.research.google.com/drive/1PscpX_K1UemIDotl2raC4QMy_pTqDq7p?usp=sharing).

Current `__repr__` of gpt-j-6b output it too big to add it to this PR description:
```
GPTJModel(
  (wte): Embedding(50400, 4096)
  (drop): Dropout(p=0.0, inplace=False)
  (h): ModuleList(
    (0): GPTJBlock(
      (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
      (attn): GPTJAttention(
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_dropout): Dropout(p=0.0, inplace=False)
        (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
      )
      (mlp): GPTJMLP(
        (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
        (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (1): GPTJBlock(
      (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
      (attn): GPTJAttention(
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_dropout): Dropout(p=0.0, inplace=False)
        (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
      )
      (mlp): GPTJMLP(
        (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
        (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (2): GPTJBlock(
...
```

Better `__repr__` output looks like this:
```
GPTJModel(
  (wte): Embedding(50400, 4096)
  (drop): Dropout(p=0.0, inplace=False)
  (h): ModuleList(
    (0-27): 28 x GPTJBlock(
      (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
      (attn): GPTJAttention(
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_dropout): Dropout(p=0.0, inplace=False)
        (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
        (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
      )
      (mlp): GPTJMLP(
        (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
        (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.0, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90452
Approved by: https://github.com/albanD
2022-12-26 17:05:14 +00:00
57dcd93c41 Fix conda install on distributions with strict POSIX sh (#91371)
See also https://github.com/conda/conda/issues/10431

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91371
Approved by: https://github.com/albanD
2022-12-26 02:39:08 +00:00
3f4e87beaf Populate the eviction_policy field for load/store properly (#91316)
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91316
Approved by: https://github.com/ngimel
2022-12-26 00:50:05 +00:00
772684c9ce Do not generate default value when it's zero (#91315)
This is more of a cosmetic change than anything really.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91315
Approved by: https://github.com/ngimel
2022-12-26 00:50:05 +00:00
f8b28799f8 Softmax added to tensor, torch and docs (#91292)
Fixes #91107

Added `softmax` docs in

- `pytorch/torch/_tensor_docs.py`
- `pytorch/torch/_torch_docs.py `
- `pytorch/docs/XXX.rst` files. Here XXX represents all those files where I made the change

Although I have added `softmax` in `docs` directory, I was not sure which files/folders required the edits so there could be issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91292
Approved by: https://github.com/lezcano
2022-12-25 12:59:45 +00:00
789b1437e9 Fix meta registration for aten._cudnn_rnn (#91333)
Found this issue from [weekly running 7k github models](https://github.com/pytorch/torchdynamo/issues/1884). This caused  regression on pass rate, there are 25 models failed due to this issue.
The reason is argument ```cx``` of ```aten._cudnn_rnn``` can be ```None```, but it doesn't handle well in meta registration, so throws the following error:
```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1059, in run_node
    return nnmodule(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/nn/modules/rnn.py", line 477, in forward
    result = _VF.rnn_tanh(input, hx, self._flat_weights, self.bias, self.num_layers,
  File "/scratch/ybliang/work/repos/pytorch/torch/_subclasses/fake_tensor.py", line 916, in __torch_dispatch__
    r = func(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_ops.py", line 284, in __call__
    return self._op(*args, **kwargs or {})
  File "/scratch/ybliang/work/repos/pytorch/torch/_meta_registrations.py", line 2108, in _cudnn_rnn
    cy = cx.new_empty(0 if cx is None else cell_shape)
AttributeError: 'NoneType' object has no attribute 'new_empty'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91333
Approved by: https://github.com/ezyang
2022-12-23 22:59:31 +00:00
df46ba4026 Use python 3.9 for iOS build and test (#91366)
Since yesterday, Miniconda3-latest-MacOSX-x86_64.sh has changed to python 3.10 as the default, and it breaks iOS workflow:

* Breaking with python 3.10 https://github.com/pytorch/pytorch/actions/runs/3763269382/jobs/6396697341
* Working with python 3.9 https://github.com/pytorch/pytorch/actions/runs/3761903011/jobs/6394085845

Fun fact, both examples above come from the same commit f471770fd4 (one was in periodic, the other was in trunk)

Miniconda3-py39_4.12.0-MacOSX-x86_64.sh is the same miniconda installation that we use in https://github.com/pytorch/test-infra/tree/main/.github/actions/setup-miniconda

Note: @remidomingues is trying to add cache support for iOS in on https://github.com/pytorch/pytorch/pull/91359.  The PR is still under review.  But once that is merged, this issue won't happen again.  So this is a temporary fix to keep trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91366
Approved by: https://github.com/atalman
2022-12-23 22:08:25 +00:00
a188e6ddc0 Fix typo in troubleshooting.rst (#91301)
enviornment -> environment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91301
Approved by: https://github.com/msaroufim
2022-12-23 21:39:38 +00:00
5725a44080 Remove Windows compilation dependencies installation from CI/CD scripts (#89909)
They should be already installed in the runner VM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89909
Approved by: https://github.com/huydhn
2022-12-23 17:40:19 +00:00
bdbf188c80 [MPS] Exclude int64 dtype from reduction ops (#91272)
Reduction ops don't support int64 data type. This PR takes care to assert when int64 is used for min / max reductions ops.
All other integer dtypes are casted to int32.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91272
Approved by: https://github.com/razarmehr, https://github.com/malfet
2022-12-23 17:30:42 +00:00
0745242ca5 Fix wrong committer when rebase and merge (#91330)
When using in the context of the merge workflow, the committer's name and email have already been set as part of the workflow, i.e. https://github.com/pytorch/pytorch/actions/runs/3754075933/jobs/6377965897:

```
git config --global user.email "pytorchmergebot@users.noreply.github.com"
git config --global user.name "PyTorch MergeBot"
```

Trying to overwrite this in tryrebase's ghstack logic would lead to the wrong committer showing up.  The fix check if the email and name have already been set so that the code doesn't overwrite them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91330
Approved by: https://github.com/kit1980, https://github.com/clee2000, https://github.com/malfet
2022-12-23 17:22:49 +00:00
69cca4f3ae Update xla base tag v06 (#90939)
We have installed the new `sympy` package requirement in the XLA CI base image. Bumping up the version tag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90939
Approved by: https://github.com/seemethere, https://github.com/malfet
2022-12-23 17:17:43 +00:00
6485d2609a [MPS] Fix data type issues in Binary Ops (#91151)
- Cast to unsigned type when comparing signed vs. unsigned integers
- Refactor and cleanup logaddexp() ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91151
Approved by: https://github.com/malfet
2022-12-23 17:11:55 +00:00
d08e3d2304 [Composable API] Apply ufmt to _composable and the corresponding test folders (#91255)
This PR apply ufmt to format `_composable` related code. This is a request from https://github.com/pytorch/pytorch/pull/91234 to separate formatting changes as a new PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91255
Approved by: https://github.com/awgu
2022-12-23 16:08:27 +00:00
99aec69f58 [BE] remove Backend.TCP (#91314)
Remove Backend.TCP which is unused. Fixes a task in https://github.com/pytorch/pytorch/issues/90544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91314
Approved by: https://github.com/awgu
2022-12-23 15:48:29 +00:00
f62a3cabfc [ROCm] enable CI after host upgrades to ROCm 5.3 and ubuntu 22.04 (#91339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91339
Approved by: https://github.com/kit1980
2022-12-23 05:41:43 +00:00
f471770fd4 Add bad status management for get_workflow_job_id (#91145)
To help resolve issues like:
```
++ python3 .github/scripts/get_workflow_job_id.py 3736406815 i-08b8fd3e605729ed9
+ GHA_WORKFLOW_JOB_ID=
Warning: Attempt 2 failed. Reason: Child_process exited with error code 1
```

This should only happen when github actions is experiencing degraded service

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91145
Approved by: https://github.com/malfet
2022-12-22 23:33:43 +00:00
94a6d72032 Update doc of clip grad (#91312)
Replaces #85772 that has a broken internal state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91312
Approved by: https://github.com/soulitzer
2022-12-22 22:34:32 +00:00
76a3869fc6 Support functionalization on torch.cond (#89966)
This PR adds functionalization path for torch.cond. As it is the first pass, we only functionalize for very restrictive use cases. We explicitly restrict following:

- Output of each branch aliasing input
- In-place mutation on inputs given to each branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89966
Approved by: https://github.com/zou3519
2022-12-22 22:01:47 +00:00
d1123c94a7 [pytorch] Update troubleshooting_url (#91298)
Summary: Update new troubleshooting_url. Old one does not exist.

Test Plan: None

Differential Revision: D42205626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91298
Approved by: https://github.com/jianyuh
2022-12-22 21:29:54 +00:00
4477a5b691 [MPS] Register unfold key for MPS (#91266)
Register unfold key for MPS (uses generic implementation that's already existent).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91266
Approved by: https://github.com/razarmehr
2022-12-22 21:21:04 +00:00
e8e3980e65 [Checkpoint] Update DCP init to include DefaultSavePlanner/DefaultLoadPlanner (#91269)
Adding the two APIs to dcp package `__init__.py`, as users are recommended to extend DefaultSavePlanner/DefaultLoadPlanner instead of the planner interface directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91269
Approved by: https://github.com/fduwjj
2022-12-22 21:05:11 +00:00
0149467677 [Checkpoint] Update docstring for DCP `save_state_dict and load_state_dict` (#91209)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91209
Approved by: https://github.com/fduwjj
2022-12-22 20:49:42 +00:00
8b3d31cfc5 Add A ValueRange Analysis Pass to convert int64 indexing to int32 (#91028)
Builds up sympy expressions computing the lower and upper bound of ranges, and then finds `op.to_dtype(x, torch.int64)` nodes whose dominated values can all be computed in a lower precision. I haven't gotten all the way to work with dynamic shapes but it should be a fairly small change. There's still additional work to get torchinductor to work with large tensors (see https://github.com/pytorch/torchdynamo/issues/1819) because we would need to add explicit dtype annotations to int64 which we're not doing right now.

Fix for https://github.com/pytorch/torchdynamo/issues/1293.

Performance Before OpBench aten.upsample_bilinear2d.vec float32:
(25th %, 50th %, 75th %)
Before
[0.7521964636710751, 0.8645357996607477, 2.8746003906598494]
After:
[0.9511363478204263, 1.0295566597806718, 3.2662165264101755]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91028
Approved by: https://github.com/jansel
2022-12-22 20:04:26 +00:00
b95e1d76a8 [CUDA12] Conditionally set device in autograd engine (#91191)
CUDA 12 introduces behavioral changes in `cudaSetDevice`. In the old version it would just set the device to be used for kernel launches and memory allocations without creating a CUDA context. Now, in CUDA 12, every time `cudaSetDevice` is called for the first time it creates a CUDA context. See issue #91122.

The autograd engine iterates over all devices and sets them:
f8b348c1fc/torch/csrc/autograd/engine.cpp (L1399-L1402)
f8b348c1fc/torch/csrc/autograd/engine.cpp (L349)

Which causes pollution of CUDA contexts on sibling devices.
This PR introduces a workaround this issue by conditionally setting the device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91191
Approved by: https://github.com/ngimel
2022-12-22 19:54:45 +00:00
4437d0d161 [functorch] vmap: chunk_size support (#91157)
Ref: https://github.com/pytorch/functorch/issues/680

We introduce a kwarg `chunk_size` in vmap.

Also, we leverage most of the code from `chunk_vmap` (except for chunking the input based on `chunk_size`)

Benchmarks from https://github.com/pytorch/functorch/pull/774 apply.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91157
Approved by: https://github.com/zou3519
2022-12-22 19:45:45 +00:00
c47bdd7522 *_scatter ops should preserve input stride/storage_offset (#91029)
It turns out that we *do* need to update *_scatter ops to return the exact same strides as their inputs. I added a test to `test/test_functionalization.py`, which now trips thanks to Ed's functionalization stride debugging check. It only actually ends up tripping silent correctness if you try to .backward() on that function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91029
Approved by: https://github.com/ezyang
2022-12-22 19:41:53 +00:00
a32916190d buck-related minifier work (#91215)
Summary: Extending the minifier to generate buck target

Test Plan: N/A

Differential Revision: D42173893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91215
Approved by: https://github.com/bertmaher, https://github.com/ngimel
2022-12-22 19:33:50 +00:00
d397f414bd [BE] Reformat ReduceOps (#91221)
Use curly braces even after single line `if`
Use whitespace between `if` and condition
Use `c10::irange`
Also, use `c10::multiply_integers` instead of explicit for loop of elements of `IntArrayRef`
Do not pass `num_input_dims` to `set_apparent_shapes` as it is always equal to the length of `input_shape` array

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91221
Approved by: https://github.com/kit1980, https://github.com/huydhn
2022-12-22 19:29:05 +00:00
c7302075f3 Fix passing frame to callback (#91170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91170
Approved by: https://github.com/ezyang
2022-12-22 19:05:18 +00:00
eadd557266 Revert "use scatter_add for index_add when dim is the most inner dim (#88729)"
This reverts commit 68e9da68cbeb1288b904022d237c32e88e0372fd.

Reverted https://github.com/pytorch/pytorch/pull/88729 on behalf of https://github.com/atalman due to Break internal build
2022-12-22 18:06:45 +00:00
bacd2ced4f [CUDA12] Clean up deprecated APIs (#91050)
See #91122
Summary:
Some APIs are deprecated in newer version of CUDA.
* cudaGraphInstantiate:
From:
```
cudaGraphInstantiate ( cudaGraphExec_t* pGraphExec, cudaGraph_t graph, cudaGraphNode_t* pErrorNode, char* pLogBuffer, size_t bufferSize )
```
To
```
__host__​cudaError_t cudaGraphInstantiate ( cudaGraphExec_t* pGraphExec, cudaGraph_t graph, unsigned long long flags = 0 )
```
* cudaProfilerInitialize: deprecated in cuda 11 and removed in cuda 12

Test Plan: GH CI

Differential Revision: D41469051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91050
Approved by: https://github.com/jianyuh
2022-12-22 17:51:01 +00:00
e40e4d36c9 Fix test_profiler_seq_nr flakiness (on macos) (#91019)
Fixes https://github.com/pytorch/pytorch/issues/66893

On MacOS, two `aten::sum` calls are reported sometimes where there should be only one.  This can be easily reproduced by running `pytest test_autograd.py -k test_profiler_seq_nr --verbose  --flake-finder` to see the flakiness.  The profile result when the test fails is as follows (sorted by CPU):

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                            aten::randn        16.67%       3.000us        27.78%       5.000us       2.500us             2
                                              aten::sum        16.67%       3.000us        27.78%       5.000us       2.500us             2
                                          aten::normal_        11.11%       2.000us        11.11%       2.000us       1.000us             2
                                              aten::add        11.11%       2.000us        11.11%       2.000us       2.000us             1
autograd::engine::evaluate_function: torch::autograd...        11.11%       2.000us        27.78%       5.000us       2.500us             2
                        torch::autograd::AccumulateGrad        11.11%       2.000us        16.67%       3.000us       1.500us             2
                                        aten::ones_like         5.56%       1.000us         5.56%       1.000us       1.000us             1
      autograd::engine::evaluate_function: SumBackward0         5.56%       1.000us        11.11%       2.000us       2.000us             1
                                           aten::expand         5.56%       1.000us         5.56%       1.000us       1.000us             1
                                            aten::copy_         5.56%       1.000us         5.56%       1.000us       0.500us             2
                                            aten::empty         0.00%       0.000us         0.00%       0.000us       0.000us             2
                                       aten::as_strided         0.00%       0.000us         0.00%       0.000us       0.000us             2
                                            aten::fill_         0.00%       0.000us         0.00%       0.000us       0.000us             2
                                       aten::empty_like         0.00%       0.000us         0.00%       0.000us       0.000us             1
                                    aten::empty_strided         0.00%       0.000us         0.00%       0.000us       0.000us             3
                                           SumBackward0         0.00%       0.000us         5.56%       1.000us       1.000us             1
      autograd::engine::evaluate_function: AddBackward0         0.00%       0.000us         0.00%       0.000us       0.000us             1
                                           AddBackward0         0.00%       0.000us         0.00%       0.000us       0.000us             1
                                aten::new_empty_strided         0.00%       0.000us         0.00%       0.000us       0.000us             2
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 18.000us
```

When it happens, the two `aten::sum` calls have different inputs:

```
                                              aten::sum         4.35%       1.000us        13.04%       3.000us       3.000us             1                          [[10, 10], []]
                                              aten::sum         8.70%       2.000us         8.70%       2.000us       2.000us             1                  [[10, 10], [], [], []]
```

I'm not sure what is the internal difference between `z.sum()` and `z.sum(dim=None)` here on MacOS, I thought they are the same.

### Testing

`pytest test_autograd.py -k test_profiler_seq_nr --verbose  --flake-finder` to run the test 50 times, all pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91019
Approved by: https://github.com/malfet
2022-12-22 17:37:45 +00:00
07c61685c8 [inductor] CI improvments (#91283)
Summary:
1) Setting torch.backends.cudnn.deterministic to True helps to
eliminate the eager_variance failures seen on CI
2) Skip Triton failure instead of retry
3) Some minor script cleanup is also included in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91283
Approved by: https://github.com/anijain2305
2022-12-22 15:37:43 +00:00
55749b9c41 [dynamo] Write full code of how to enable output_code (#91230)
Ref https://github.com/pytorch/pytorch/pull/91223
Since it was trickier than I've expected

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91230
Approved by: https://github.com/soumith
2022-12-22 14:09:06 +00:00
4c5928e387 Fix for mul(compressed, wrapped scalar) (#91239)
Fixes https://github.com/pytorch/pytorch/issues/90819.

The path with `Scalar` should have been picked up by the dispatcher, but still the path with a 0-dim wrapped scalar was broken.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91239
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2022-12-22 13:11:13 +00:00
bc843682dd [inductor] New approach for computing triton load/store masks (#91241)
This PR is a new version of #89566, fixing a test failure.

Couldn't get ghstack to colaborate on updating that PR after re-opening,
so started a new one.

This changes the way masks for loads/stores are computed in triton backend of inductor.

New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and  `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.

I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.

Relative to #89566, the only change is to not include the mask variables
of arguments when the function being called is `tl.where`. The reason
being that `tl.where` is often used precisely to make sure the output
variable has valid values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91241
Approved by: https://github.com/ngimel
2022-12-22 11:54:48 +00:00
fd3a7264ae [MPS] Add group_norm[fwd+backward] and mean_var (take 2) (#91190)
Use Prims to implement group_norm, group_norm_backward and mean_var

Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.

Add `__all__` to `torch/backends/mps/__init__.py` as well as alias all imports as private

Add `TestNNMPS.test_group_norm_backward` that validates no NaNs are generated during the backward pass

Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
2022-12-22 08:54:37 +00:00
9b42e4ef73 [Composable API] Make _StateKey as a str subclass (#91279)
The keys in object.__dict__ should be strings. Make the _StateKey be a str subclass.

Differential Revision: [D42200244](https://our.internmc.facebook.com/intern/diff/D42200244/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91279
Approved by: https://github.com/awgu, https://github.com/mrshenli
2022-12-22 06:01:06 +00:00
39306c1dfb Use @pytorch// in bazel build files (#89660)
This change aims to make bazel build more embeeding-friendly.
Namely, when PyTorch is included as an external repo in another project, it is usually included like this
```
        native.local_repository(
            name = "pytorch",
            path = ...,
            repo_mapping = repo_mapping,
        )
```
Or
```
        http_archive(
            name = "pytorch",
            urls = ...
            repo_mapping = repo_mapping,
        )
```
In this case, references to `@//` would resolve to the top-level WORKSPACE that includes PyTorch.
That makes upgrades harder because we need to carry around this patch.
Note that under some edge-case circumstances even `//` resolves to the top-level `WORKSPACE`.

This change makes the embedding of the bazel build easier without compromising anything for the main repo, since the `@pytorch//` still refers to the same thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89660
Approved by: https://github.com/kit1980
2022-12-22 05:14:55 +00:00
6cea4f3d57 [FSDP][optim_state_dict][7/N] Make FSDP support NamedOptimizer (#91160)
**What does this PR do?**
This PR refactors FSDP optimizer state_dict APIs to accept `NamedOptimizer` as the input optimizer. The key difference is that the state_dict returned by `NamedOptimizer` is already keyed as FQN. This PR majorly changes the internal mapping to allows the optimizer state_dict to be keyed as FQN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91160
Approved by: https://github.com/fduwjj, https://github.com/rohan-varma
2022-12-22 04:35:26 +00:00
71318742f9 [vision hash update] update the pinned vision hash (#91284)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91284
Approved by: https://github.com/pytorchbot
2022-12-22 03:33:06 +00:00
a0554261a1 Restore RNG states for composable reentrant activation checkpointing (#91265)
This allows ops like randperm to behave the same during re-computation.

Differential Revision: [D42196758](https://our.internmc.facebook.com/intern/diff/D42196758/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91265
Approved by: https://github.com/awgu
2022-12-22 03:15:55 +00:00
8f16524598 Run test_spectral_ops serially to fix CUDA illegal memory access (#91264)
Fixes https://github.com/pytorch/pytorch/issues/88916

* Running this test sequentially is not flaky after 1000 reruns `pytest --verbose test_spectral_ops.py -k test_fft_round_trip_cuda_float32 --flake-finder --flake-runs=1000`
* On the other hand, the curious thing is that when I run this same command on an active runner with some testing processs running in the background, the reruns could fail with CUDA illegal memory access error (hard to reproduce though) https://paste.sh/6sZdRn95#pve73riXC5XehCLqxlCbnjea.  This points to the fact that running the `test_spectral_ops` test in parallel with others might be the surface-level cause of flakiness

So this PR adds the test to the serial list instead.  This shouldn't cause any issue w.r.t TTS because the test takes only half a minute at most to finish.

```
+---------------------+-------------------------------------------------+-------------+---------------------+
| file                | base_name                                       | test_config | time                |
+---------------------+-------------------------------------------------+-------------+---------------------+
| "test_spectral_ops" | "cuda11.6-py3.10-gcc7-sm86"                     | "default"   | 5.991666666666661   |
| "test_spectral_ops" | "cuda11.6-py3.10-gcc7-sm86"                     | "slow"      | 0.18433333333333346 |
| "test_spectral_ops" | "linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck" | "default"   | 9.866000000000003   |
| "test_spectral_ops" | "linux-bionic-cuda11.6-py3.10-gcc7"             | "default"   | 10.591333333333337  |
| "test_spectral_ops" | "linux-bionic-cuda11.6-py3.7-gcc7-debug"        | "default"   | 11.395000000000003  |
| "test_spectral_ops" | "linux-bionic-cuda11.7-py3.10-gcc7"             | "default"   | 9.424               |
| "test_spectral_ops" | "linux-bionic-cuda11.7-py3.7-gcc7-debug"        | "default"   | 8.889000000000003   |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9"                     | "crossref"  | 6.280333333333329   |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9"                     | "default"   | 12.182999999999998  |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9"                     | "dynamo"    | 11.124999999999984  |
| "test_spectral_ops" | "linux-bionic-py3.7-clang9-slow"                | "slow"      | 0.1916666666666668  |
| "test_spectral_ops" | "linux-focal-py3.7-clang7-asan"                 | "default"   | 20.899666666666658  |
| "test_spectral_ops" | "linux-focal-py3.7-gcc7"                        | "default"   | 5.097999999999996   |
| "test_spectral_ops" | "linux-focal-rocm5.3-py3.8-slow"                | "slow"      | 0.23700000000000018 |
| "test_spectral_ops" | "macos-12-py3-arm64"                            | "default"   | 2.8396666666666626  |
| "test_spectral_ops" | "macos-12-py3-x86-64"                           | "default"   | 8.838999999999997   |
| "test_spectral_ops" | "parallelnative-linux-focal-py3.7-gcc7"         | "default"   | 5.016999999999998   |
| "test_spectral_ops" | "win-vs2019-cpu-py3"                            | "default"   | 8.351666666666665   |
| "test_spectral_ops" | "win-vs2019-cuda11.6-py3"                       | "default"   | 27.121666666666687  |
| "test_spectral_ops" | "win-vs2019-cuda11.7-py3"                       | "default"   | 24.567000000000025  |
+---------------------+-------------------------------------------------+-------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91264
Approved by: https://github.com/clee2000
2022-12-22 02:39:33 +00:00
365071c73c Fix non-existing parameters in docstrings in torch/distributed (#91116)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91116
Approved by: https://github.com/huydhn
2022-12-22 02:37:31 +00:00
b50f379cec Remove inductor performance from ciflow/nightly as infra is not ready to handle these jobs… (#91271)
… yet.

https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml currently shows there are several commits waiting for A100 runners but the infra is not able to automatically respond to these dynamic requests. Therefore disabling ciflow/nightly tag and only use scheduled and workflow_dispatch.

Also remove postnightly filter as the [postnightly pull request ](https://github.com/pytorch/pytorch/pull/27167) is no longer running ci tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91271
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb, https://github.com/malfet
2022-12-22 01:56:08 +00:00
68e9da68cb use scatter_add for index_add when dim is the most inner dim (#88729)
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.

`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.

### Testing

- Single core:

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003

- Single socket (28 cores):

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2022-12-22 01:13:35 +00:00
59a5be3b45 add mixed data type support for GroupNorm backward on CPU (#88663)
### Motivation
Amp provides convenience methods for mixed precision. If users use amp to run bfloat16 models, torch.autocast will keep module parameters in acc dtype which will leave gamma and beta in float while input/output will be in bfloat16. The same goes for backward: parameters are in float, and X & dX & dY are in bfloat16.
Mixed data type support for GroupNorm backward is also needed for model training with GroupNorm.

### Testing

Single socket (28cores):
* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 3.08E-05 | 3.50E-05 | 8.06E-05 | 7.69E-05
[10, 128, 50, 50] | 0.000121 | 0.000114 | 0.000358 | 0.000203

* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 4.04E-05 | 4.41E-05 | 0.000226 | 0.000305
[10, 128, 50, 50] | 0.000169 | 0.000166 | 0.001628 | 0.001169

Single core:

* Contiguous:

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.38E-04 | 2.51E-04 | 5.94E-04 | 4.50E-04
[10, 128, 50, 50] | 0.00171 | 0.001395 | 0.0044455 | 0.00243

* Channels Last (inputs and outputs will be converted to contiguous):

shape | forward / s | forward / s | backward / s | backward / s
-- | -- | -- | -- | --
  | fp32 | mixed fp32 bf16 | fp32 | mixed fp32 bf16
[10, 128, 20, 20] | 2.28E-04 | 3.26E-04 | 0.0016528 | 0.003165
[10, 128, 50, 50] | 0.001788 | 0.001302 | 0.0276621 | 0.019447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88663
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/malfet
2022-12-22 01:12:42 +00:00
8e55d5831a add cu118 workflows for Windows (#91216)
CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91216
Approved by: https://github.com/atalman
2022-12-22 01:11:24 +00:00
014d7802c8 [MPS] Fix the error with high watermark value on x86 (#91268)
Fixes the error with high watermark value on x86 (`RuntimeError: invalid high watermark ratio 1.7`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91268
Approved by: https://github.com/razarmehr
2022-12-22 00:35:08 +00:00
85e393bade Fix for RNN/LSTM/GRU modules to work with stateless.functional_call() (#91111)
Fixes #90500

The change here checks for parameter changes at the beginning of each `forward()` call; if the parameters are found to be different tensors than last time, `self._flat_weights` is re-initialized with the new values. Thus, swapping parameter values using `stateless.functional_call()` will re-initialize `self._flat_weights` during the `forward()` call, and the provided parameters will be used for module computation as expected.

NB: There are still some changes needed for symbolic shapes to work with `nn.GRU` (will address in a follow-up PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91111
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-12-21 23:40:08 +00:00
300f777796 [Vulkan] Use EXPECT_EQ instead of ASSERT_TRUE in vulkan_api_test querypool_flushed_shader_log (#91259)
Summary:
After this change, if the querypool_flushed_shader_log test fails:
1) The test continues after the first failure and checks all three (Because ASSERT was changed to EXPECT)
2) The op names which are compared to vulkan.add, vulkan.sub, and vulkan.mul are shown (rather than not showing what the wrong op name was) (Because we use ..._EQ(a, b) instead of just checking ...(a == b))

This change makes it easier to debug future failures to querypool_flushed_shader_log (it helped me when one of my diffs broke the test)

Test Plan:
Vulkan API Test
- https://www.internalfb.com/intern/aibench/details/959371570734292

Reviewed By: SS-JIA

Differential Revision: D42186371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91259
Approved by: https://github.com/SS-JIA
2022-12-21 23:34:24 +00:00
2a23dfe8ed [quant] Support lowering for quantized embedding byte operator (#91159)
Summary: This PR adds lowering for embedding in quantization in executorch flow

Test Plan: buck run executorch/exir/tests:quant_fusion_pass -- "executorch.exir.tests.test_quant_fusion_pass.TestQuantFusionPass.test_embedding_byte"

Reviewed By: qihqi

Differential Revision: D41673139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91159
Approved by: https://github.com/vkuzo
2022-12-21 22:52:24 +00:00
b68fd7e319 Revert "Store source, not sname, in Symbol (#91057)"
This reverts commit 88c581be87ac59ea1251f35a57b610ae81b9362d.

Reverted https://github.com/pytorch/pytorch/pull/91057 on behalf of https://github.com/atalman due to causing internal build failures
2022-12-21 22:33:15 +00:00
6e0cd8b91e [Resubmit] Require inductor to match stride order (#91185)
Resubmitting https://github.com/pytorch/pytorch/pull/90563 because I had a commit in that stack which didn't use my CLA-approved git username.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91185
Approved by: https://github.com/desertfire, https://github.com/anijain2305
2022-12-21 21:49:37 +00:00
1ab6ac4682 [FSDP][optim_state_dict][6/N] Refactor the optim_state_dict APIs to support hooks (#90798)
**What does this PR do?**

This PR splits the FSDP optim_state_dict APIs into common implementation parts that are shared for different frontend APIs (we have many now and will consolidate them gradually). This PR also add `_optim_state_dict_post_hook` and `_load_optim_state_dict_pre_hook` for the integration with `NamedOptimzer`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90798
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-12-21 21:38:14 +00:00
d19988093d [autograd Function] Return input as-is if marked dirty even when requires_grad=False (#91214)
Fixes https://github.com/pytorch/pytorch/issues/90209

Somewhat related: https://github.com/pytorch/pytorch/issues/71119
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91214
Approved by: https://github.com/albanD
2022-12-21 21:20:56 +00:00
fb2e1878cb [torch.func] alias torch.func.vmap as torch.vmap (#91026)
This PR also redirects torch.vmap to torch.func.vmap instead of the old
vmap prototype.

Test Plan:
- tests
- view docs preview
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91026
Approved by: https://github.com/albanD, https://github.com/samdow
2022-12-21 20:51:49 +00:00
e803d336eb Fix missing indentation in serialization.rst (#91253)
Fixes #ISSUE_NUMBER

In serialization.rst, fix class ControlFlowModule's forward(): missing indentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91253
Approved by: https://github.com/kit1980
2022-12-21 20:14:44 +00:00
48511eca82 [pruning][docs] Update README.md for structured pruning (#90403)
Summary:

I wrote a tutorial of how to use structured pruning flow as part of BE week

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90403
Approved by: https://github.com/HDCharles
2022-12-21 20:07:06 +00:00
6a3ddd0171 Revert "Don't graph break on patched module methods or aliased methods (#91018)"
This reverts commit d6fc2d82ca616f87d9fef49e84e6d4ff6976292f.

Reverted https://github.com/pytorch/pytorch/pull/91018 on behalf of https://github.com/kit1980 due to After this PR, inductor / cuda11.6-py3.10-gcc7-sm86 / test fails every time with CUDA out of memory during OPTForCausalLM
2022-12-21 19:54:15 +00:00
81a9a0ac07 [MPS] Fix gather for uint8 dtype in index_select (#91047)
Use int8 instead of uint8 for MPS Gather/Scatter (uint8 is broken in macOS Monterey)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91047
Approved by: https://github.com/razarmehr
2022-12-21 19:48:46 +00:00
b285f1080f Fix small typo in comment (#91247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91247
Approved by: https://github.com/albanD
2022-12-21 19:45:39 +00:00
97f514f38e Fix two typos in torch.distributed.distributed_c10d.py::broadcast_object_list (#91237)
Fixes #91236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91237
Approved by: https://github.com/malfet, https://github.com/H-Huang
2022-12-21 19:45:08 +00:00
e3383d296f [optim][fix] test_fused_optimizers did not test fused before (#91228)
I realized test_fused_optimizers used a helper that was written for foreach, so we were not testing fused at all. This PR fixes that test so we actually test fused adam.

The explicitly adding fused=False is to set the stage for my later changes (but should be a no-op here).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91228
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-12-21 19:42:24 +00:00
c7f1974cf1 Fix FastToLocals call by copy pasting (#91168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91168
Approved by: https://github.com/ezyang
2022-12-21 19:39:04 +00:00
5e77971a6e Fix all simple compilation issues in eval_frame.c (#91166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91166
Approved by: https://github.com/ezyang
2022-12-21 19:39:04 +00:00
b7f48d71fe Upgrade lintrunner numpy to a version supported by 3.11 (#91164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91164
Approved by: https://github.com/ezyang
2022-12-21 19:39:04 +00:00
c0e7d8f84c Use python compat from python/pythoncapi_compat (#91163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91163
Approved by: https://github.com/ezyang
2022-12-21 19:39:04 +00:00
645eda0a00 Revert "[MPS] Add group_norm[fwd+backward] and mean_var (#91190)"
This reverts commit 371716eb36b7447003f1643f14ff1c5998a9302c.

Reverted https://github.com/pytorch/pytorch/pull/91190 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names because of underscore _ops
2022-12-21 19:37:43 +00:00
8b617f813d [cuBLAS] Add an option to disable reduced precision reductions for BF16 GEMM (#89172)
Essentially the same change as #67946, except that the default is to disallow reduced precision reductions in `BFloat16` GEMMs (for now). If performance is severely regressed, we can change the default, but this option appears to be necessary to pass some `addmm` `BFloat16` tests on H100.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89172
Approved by: https://github.com/ngimel
2022-12-21 18:58:28 +00:00
1c7e81576a Temporarily disable ROCm periodic tests (#91256)
There is an ongoing maintenance and everything fails.  Prior PR #91217 did not also disable periodic jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91256
Approved by: https://github.com/kit1980
2022-12-21 18:42:56 +00:00
eeb9154b27 [MPS] Add MPSHooks interface to enable accessing MPS functions globally (#91104)
This PR is a prerequisite to the upcoming MPSGenerator changes required for Random Ops.

Add `MPSHooksInterface.cpp` to `aten_cpu_source_non_codegen_list`

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91104
Approved by: https://github.com/kulinseth, https://github.com/malfet
2022-12-21 17:37:09 +00:00
371716eb36 [MPS] Add group_norm[fwd+backward] and mean_var (#91190)
Use Prims to implement group_norm, group_norm_backward and mean_var

Use `torch._ops.ops` instead of `torch.ops` in numerous subpackages in
order to be able to make them importable from `torch/backend/mps/__init__.py` as this alias is defined in
15af4b1cee/torch/__init__.py (L1095)
is executed last during init process.

Depends on https://github.com/pytorch/pytorch/pull/91203

Fixes https://github.com/pytorch/pytorch/issues/88331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91190
Approved by: https://github.com/albanD
2022-12-21 17:33:27 +00:00
d6fc2d82ca Don't graph break on patched module methods or aliased methods (#91018)
See added tests for the cases that were fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91018
Approved by: https://github.com/Morgan77523, https://github.com/anijain2305
2022-12-21 16:29:15 +00:00
15af4b1cee Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos, https://github.com/malfet
2022-12-21 11:56:58 +00:00
bfdc0358dc Compile fix for Clang + libc++ (#91212)
Summary:
LLVM 15 has a compile issue with the deprecated __has_trivial_copy. Update the GCC ifdef logic to exclude Clang + libc++.

```
caffe2/c10/util/Optional.h:536:13: error: builtin __has_trivial_copy is deprecated; use __is_trivially_copyable instead [-Werror,-Wdeprecated-builtins]
            C10_IS_TRIVIALLY_COPYABLE(T) &&
            ^
caffe2/c10/macros/Macros.h:438:38: note: expanded from macro 'C10_IS_TRIVIALLY_COPYABLE'
#define C10_IS_TRIVIALLY_COPYABLE(T) __has_trivial_copy(T)
```

Test Plan: CI

Reviewed By: kit1980

Differential Revision: D42180203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91212
Approved by: https://github.com/kit1980, https://github.com/soumith
2022-12-21 11:19:58 +00:00
6d2b0cbb40 [Re-landing 86706] [JIT] Frozen Graph Linear-BatchNormNd Folding (#91020)
Re-landing #86706

This PR adds linear-batchnormNd folding for JIT frozen graphs.

**Performance benchmark**
A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine.

**FP32, JIT**
without linear-bn folding
![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png)

with linear-bn folding
![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91020
Approved by: https://github.com/davidberard98
2022-12-21 08:00:32 +00:00
e8bf7c21e4 Integrate apply_optim_in_backward with DDP (#89194)
Allow _apply_optim_in_backward to work with DDP.

Example:

```
dist.init_process_group("nccl", rank=rank, world_size=2)
    torch.cuda.set_device(rank)
    e = enc().cuda(rank)
    _apply_optimizer_in_backward(
        optimizer_class=torch.optim.SGD,
        params=e.parameters(),
        optimizer_kwargs={"lr": 0.03},
    )
    e = DDP(e, device_ids=[rank])
    inp = torch.randn(1, 10, device=rank)
    e(inp).sum().backward()
```

Constraints:

1. Custom communication hook not yet supported
2. _apply_optim_in_backward needs to be called _before_ wrapping model in DDP.
3. DDP will remove the gradient hooks _apply_optim_in_backward registers, so these gradient hooks will not be fired and cannot be used.
4. All DDP managed parameters have grads set to None by default once optimizer is applied. There is no support for setting only some parameter grads to None, this must be done manually by user (and DDP_OVERLAPPED_OPTIM_SET_GRADS_TO_NONE=0 needs to be set.)

Differential Revision: [D41329694](https://our.internmc.facebook.com/intern/diff/D41329694/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41329694/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89194
Approved by: https://github.com/zhaojuanmao
2022-12-21 07:35:19 +00:00
8992eec781 [inductor] Update how REQUIRE_HIGHER_TOLERANCE is handled (#91227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91227
Approved by: https://github.com/kit1980
2022-12-21 05:43:39 +00:00
b7f35e4104 [MPS] Fix index_add with non-f32 inputs (#88542)
The `multiplicationWithPrimaryTensor` and/or `scatterWithDataTensor` api has issues with handling two f16 tensor inputs, resulting in zeros outputs. With int16 or int64 inputs, there are issues as well.

This PR conditionally casts inputs to f32 if they're not and then casts the output back to the source's datatype.

Fixes #82645.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88542
Approved by: https://github.com/kulinseth
2022-12-21 05:31:03 +00:00
0476201482 Update debug option for torch._dynamo (#91223)
Seems outdated from https://www.youtube.com/watch?v=egZB5Uxki0I

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91223
Approved by: https://github.com/ngimel
2022-12-21 05:06:42 +00:00
b66862ba87 [autograd Function] Don't materialize forward grad for non-differentiable types (#91183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91183
Approved by: https://github.com/zou3519
2022-12-21 05:05:44 +00:00
88c581be87 Store source, not sname, in Symbol (#91057)
I'm going to need this in the follow up PR. Instead of storing only Source.name() in Symbol, I now store a full on Source. Lots of replumbing reoccurs. In particular:

- Move Source to torch._guards to break cycles
- I have to add TensorPropertySource and NegateSource to handle x.size()[0] and -x codegen that I was doing with string manipulation previously
- I tighten up invariants so that I never pass source=None; instead I pass ConstantSource (these are constant sources right) and test for that rather than source being missing. I think this is more parsimonious
- Some mypy wobbles from new imports

I didn't move LocalSource and friends to torch._guards, but I ended up needing to access them in a few places. The main annoyance with moving these is that then I also need to move the bytecode codegen stuff, and that's not so easy to move without bringing in the kitchen sink.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91057
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2022-12-21 04:51:51 +00:00
5d37890b8e Update torchrun and TorchElastic to take optional local_addr param to allow skip local IP lookup if specified (#88922)
Summary:
Update dynamic renderzvous nodes to use rendezvous hostname if provided.
For PR: https://github.com/pytorch/pytorch/issues/85300

Before:
For dynamic renderzvous, it always grab the `fqdn` from socket for each node even if user specified the address.
For example,
https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py#L248-L256
```
return _NodeDesc(socket.getfqdn(), os.getpid(), local_id)
```

Now:
If user specifies the hostname, each node will respect the given hostname.
For example, `socket.getfqdn(<hostname>) `

Test Plan: Unit tests.

Differential Revision: D41204028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88922
Approved by: https://github.com/d4l3k
2022-12-21 03:55:01 +00:00
57390116e0 Restructure ShapeEnv so it uses GuardBuilder.SHAPE_ENV directly (#91055)
The idea is to make ShapeEnv guards less of a one-off special snowflake, and integrate it more closely with the regular builder infrastructure. But it is not so easy: the shape env code has to live after tensor match code, because we need to know that the values in question are tensors before we start matching on them. So we introduce a new `shape_env_code` field to put the special shape env code, so we can add it to the final constructed code after tensor.

Everything else works the obvious way. There's a new ShapeEnvSource for constructing the singleton SHAPE_ENV guard that drives the shape env guard construction. I added some more docs and also made the printed code for guards include the enclosing lambda for more clarity.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91055
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2022-12-21 03:50:47 +00:00
2f154f68ea [torchgen] Add CI job to make sure torchgen works for Executorch op registration (#89596)
## Job

Test running on most CI jobs.

## Test binary

* `test_main.cpp`: entry for gtest
* `test_operator_registration.cpp`: test cases for gtest

## Helper sources

* `operator_registry.h/cpp`: simple operator registry for testing purpose.
* `Evalue.h`: a boxed data type that wraps ATen types, for testing purpose.
* `selected_operators.yaml`: operators Executorch care about so far, we should cover all of them.

## Templates

* `NativeFunctions.h`: for generating headers for native functions. (not compiled in the test, since we will be using `libtorch`)
* `RegisterCodegenUnboxedKernels.cpp`: for registering boxed operators.
* `Functions.h`: for declaring operator C++ APIs. Generated `Functions.h` merely wraps `ATen/Functions.h`.

## Build files

* `CMakeLists.txt`: generate code to register ops.
* `build.sh`: driver file, to be called by CI job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89596
Approved by: https://github.com/ezyang
2022-12-21 03:07:32 +00:00
37ea99cd25 [QNNPACK] Add more unaligned attributes (#91208)
Summary: Bypass "Runtime error: store to misaligned address [...] for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment" for q8conv.

Reviewed By: scramsby

Differential Revision: D42179009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91208
Approved by: https://github.com/kimishpatel
2022-12-21 03:01:11 +00:00
a274b5b99e [MPS] Fix data type issues in Unary ops (#91120)
Refactored sigmoid() and log1p()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91120
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2022-12-21 02:42:59 +00:00
c8546c930f [BE] Use aten global in torch._refs (#91189)
Similar to pattern used in `torch._decomp`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91189
Approved by: https://github.com/ngimel
2022-12-21 02:28:51 +00:00
46f64117db [BE] Use aten global var (#91188)
s/torch.ops.aten/aten/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91188
Approved by: https://github.com/ngimel
2022-12-21 02:28:51 +00:00
dd735b96df [MPS] Fix torch.std/torch.var default/correction handling (#91203)
If `torch.std`, `torch.var` are invoked without any arguments, it should be assumed that `unbiased` is `True`.

Also, if `correction` parameter is specified it should be use in correction computation.

Test by adding `std` and `var` to consistency tests

Fixes https://github.com/pytorch/pytorch/issues/91198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91203
Approved by: https://github.com/kit1980
2022-12-21 02:23:50 +00:00
e670c261c5 Decompose fill, zero, and zeros_like (#90968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90968
Approved by: https://github.com/ngimel
2022-12-21 00:59:50 +00:00
eeacb6ae04 Temporarily disable ROCm tests (#91217)
There is an ongoing maintenance and everything fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91217
Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/malfet
2022-12-21 00:38:34 +00:00
2f37804cae [generate_vmap_rule] Add generate_vmap_rule to autograd.Function (#90966)
Design document:
https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit

This PR adds a `generate_vmap_rule` option (default False) to autograd.Function.
By setting it to True, a user promises to us that their autograd.Function's
{forward, backward, jvp}, if defined, only uses PyTorch operations, in addition to the other
limitations of autograd.Function+functorch (such as the user not
capturing any Tensors being transformed over from outside of the
autograd.Function).

Concretely, the approach is:
- we update `custom_function_call` to accept an additional
`generate_vmap_rule` argument.
- The vmap rule for `custom_function_call` and `generate_vmap_rule=True`
is: we construct a vmapped version of the autograd.Function and dispatch
on it.
- The vmapped version of the autograd.Function can be thought of like
the following: if we have an autograd.Function Foo, then
VmappedFoo.apply(in_dims, ...) has the same semantics as
vmap(Foo.apply, in_dims...)
- VmappedFoo's forward, setup_context, and backward staticmethod are
vmapped versions of Foo's staticmethods.
- See the design doc for more motivation and explanation

Test Plan:
- This PR introduces additional autograd.Function with the suffix "GenVmap" to
autograd_function_db.
- There are also some minor UX tests

Future:
- jvp support
- likely more testing to come, but please let me know if you have
cases that you want me to test here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90966
Approved by: https://github.com/soulitzer
2022-12-21 00:34:44 +00:00
2a55984139 [generate_vmap_rule] reductify_leaf helper function (#90965)
As seen in
https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit

`reductify_leaf(grad_input, ...)` is a helper function that processes a
single grad_input Tensor. The reason why we need it is:
- the grad_input has some optional bdim
- the input has some optional bdim
- if these are different, we need to coerce the grad_input into having
the same shape as the input, either by reducing or expanding the
grad_input.

Note that there is a special case in autograd that the user is allowed
to return a grad_input Tensor that is an expanded version of the
original input tensor. In this case, autograd automatically reduces
grad_input to the same shape as the input. Unfortunately this logic
doesn't work when bdims are involved, so we manually handle it in
`reductify_leaf`.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90965
Approved by: https://github.com/soulitzer
2022-12-21 00:34:44 +00:00
53c94ef1bb [generate_vmap_rule] Add mechanism to override ctx.saved_tensors (CtxWithSavedTensors) (#90964)
As seen in
https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit#heading=h.r3ckcnsh1cxt

This PR creates CtxWithSavedTensors. You can wrap a ctx object in the
backward pass of autograd.Function in CtxWithSavedTensors and specify
the saved_tensors attribute. CtxWithSavedTensor acts like the original
ctx object (all other attribute accesses are forwarded to the original ctx
object) but it has a custom saved_tensors field.

Test Plan:
- tests that you can use CtxWithSavedTensors to get a new object with
your own saved_tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90964
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-12-21 00:34:43 +00:00
31981d0139 [generate_vmap_rule] add restore_vmap helper function (#90963)
As seen in
https://docs.google.com/document/d/1bIQkWXy3J35_20c_a5kchikabBW5M8_uRAhl0BIMwU4/edit

`restore_vmap` is a private helper function. It is vmap but has the
following
differences:
- instead of returning outputs, it returns an (outputs, out_dims) tuple.
  out_dims is a pytree of shape shape as outputs and contains Optional[int]
  specifying where the vmapped dimension, if it exists, is in the
  corresponding output.
- does no validation on in_dims or inputs (vmap expects at least one
  Tensor to be vmapped).
  restore_vmap allows for no inputs to have the vmap dimension
- does no validation on outputs (vmap expects only Tensor outputs)
  restore_vmap allows for return of arbitrary outputs (not just
  Tensors)

Test Plan:
- added some simple test to test restore_vmap
- I am OK with restore_vmap not being a part of vmap right now -- the
implementation of vmap rarely changes and it is a bit difficult to
refactor vmap in a way that restore_vmap is a subroutine.

Other questions:
- Bikeshedding the `restore_vmap` name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90963
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-12-21 00:34:41 +00:00
94262efc7d Revert "[inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)"
This reverts commit d6dd2e97da619319a103d1061290fe33ce33b6a4.

Reverted https://github.com/pytorch/pytorch/pull/91105 on behalf of https://github.com/atalman due to Broke internal builds
2022-12-21 00:02:38 +00:00
e932c3e547 Delete dead intermediary_symbols (#91070)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91070
Approved by: https://github.com/soumith
2022-12-20 23:51:44 +00:00
e5a748fef8 [Nested Tensor] do not use at::cuda::getDefaultCUDAStream(), again (#91180)
Otherwise, Nested Tensor kernels won't sync with current stream, resulting in flaky unit tests in test_nestedtensor.py.

This is the second time the wrong streams have been used in NestedTensor code. See #84134 for another example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91180
Approved by: https://github.com/mikaylagawarecki
2022-12-20 23:44:59 +00:00
1c46a32b67 Minor typing improvements (#91068)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91068
Approved by: https://github.com/Skylion007, https://github.com/soumith
2022-12-20 23:43:11 +00:00
7fecba7bdb Doc improvement in LKJCholesky distribution (#91091)
Better structure & formatting. Added more info to reference.

The change can be viewed here: https://docs-preview.pytorch.org/91091/distributions.html?highlight=lkjcholesky#torch.distributions.lkj_cholesky.LKJCholesky
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91091
Approved by: https://github.com/kit1980
2022-12-20 23:38:57 +00:00
dafd0432ee Update __init__.py (#91196)
Fixes #91080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91196
Approved by: https://github.com/janeyx99
2022-12-20 23:38:25 +00:00
712170e929 [threaded pg] adapt test_pointwise_ops.py (#90713)
Differential Revision: [D42153660](https://our.internmc.facebook.com/intern/diff/D42153660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90713
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
a6dcebf997 [threaded pg] make exception handling consistent with MultiProcessTestCase (#90712)
Differential Revision: [D42153661](https://our.internmc.facebook.com/intern/diff/D42153661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90712
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
34da446072 [threaded pg] add assertion util to MultiThreadedTestCase (#90595)
Differential Revision: [D42153662](https://our.internmc.facebook.com/intern/diff/D42153662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90595
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
c7e7ea92e2 [NamedOptimizer][2/N] Prepare the enablement of state_dict for FSDP (#91147)
1. Add param_group check logic and unit test
2. Remove unnecessary check for conditional param update
3. Return the param_group from the inner optimizer so that when param_group is None or not all params are specified, we still return the expected result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91147
Approved by: https://github.com/fegin
2022-12-20 23:23:04 +00:00
c248f2f379 [ROCm] Modify GPUs visibility code when starting docker container (#91031)
Use ROCR_VISIBLE_DEVICES to limit GPU visibility, in preparation for CI node upgrade to ROCm5.3 KFD and UB22.04.

### PROBLEM
After upgrading some of our CI nodes to UB22.04 and ROCm5.3KFD, rocminfo doesn't work inside the docker container if we use the following flags: `--device=/dev/dri/renderD128 --device=/dev/dri/renderD129`. It gives the error:

```
+ rocminfo
ROCk module is loaded
Failed to set mem policy for GPU [0x6b0d]
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1140
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
```

### WORKAROUND
Use `--device=/dev/dri` instead, and use `ROCR_VISIBLE_DEVICES` to limit GPU visibility inside container.

### BACKGROUND OF ORIGINAL CODE
We introduced these flags to prepare for 2 runners per CI node, to split up the GPU visibility among the runners: https://github.com/pytorch/pytorch/blame/master/.github/actions/setup-rocm/action.yml#L58
That effort - 2 runners per CI node - is still pending, and we might need to revisit this patch when we try to enable that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91031
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-12-20 23:23:00 +00:00
f460893cec Update optim.rst (#91195)
Fixes #91080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91195
Approved by: https://github.com/kit1980
2022-12-20 23:22:25 +00:00
c43209db4d use libdevice for tanh (#90889)
Per title
I see slight differences in perf with this implementation, where standalone tanh is slightly slower for a tensor of 4000000
 elements (20.4 us instead of 19.4us), other sizes are within noise.
 @bertmaher could you check if it affects your benchmarks?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90889
Approved by: https://github.com/bertmaher, https://github.com/anijain2305
2022-12-20 23:21:37 +00:00
192a11d49c refactor the dfs cyclic search from recursion to iterative approach (#91042)
Follow up on PR #86511

Python's 1000 limit on recursion depth is not practical for us to run cyclic check on larger graphs. This refactor avoids that issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91042
Approved by: https://github.com/kit1980
2022-12-20 23:15:30 +00:00
e6fcf7ad9d Remove breakpoint (#91128)
This was left in https://github.com/pytorch/pytorch/pull/90026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91128
Approved by: https://github.com/kit1980
2022-12-20 22:14:35 +00:00
cdbca3563e Small operatorbench changes (#91027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91027
Approved by: https://github.com/desertfire
2022-12-20 21:59:52 +00:00
83f4e30ea7 Use deque instead of list for BFS (#91139)
Using list with `pop(0)` makes the search running time quadratic instead of linear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91139
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2022-12-20 21:40:43 +00:00
649d0b6ae7 Add env var PYTORCH_TEST_RUN_EVERYTHING_IN_SERIAL=1 that allows running unit test suites in serial (#90981)
Running unit test suites in parallel sometimes creates unexpected errors. This PR adds an option that allows unit test suites to be executed in serial, by setting PYTORCH_TEST_RUN_EVERYTHING_IN_SERIAL=1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90981
Approved by: https://github.com/malfet, https://github.com/ptrblck
2022-12-20 21:20:59 +00:00
2f5759eaba Disable non-deterministic models for optimizers (#91149)
These two models are non-deterministic even with constant inputs + weights and sometimes fail with variations between the fp64 and fp32 models in CI very rarely as a result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91149
Approved by: https://github.com/desertfire
2022-12-20 20:19:54 +00:00
f8b348c1fc Update ProcessGroupRoundRobin (#91172)
Summary:
Temporary fix to unblock jobs in https://fb.workplace.com/groups/300451907202972/permalink/906337097050850/

Real fix would be to remove use of _round_robin_process_group API and update corresponding references (e.g. PyText)

Test Plan: sandcastle

Differential Revision: D42169592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91172
Approved by: https://github.com/awgu
2022-12-20 19:53:34 +00:00
5ed5dfd915 Don't run ios jobs on forks (#91112)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91112
Approved by: https://github.com/huydhn
2022-12-20 19:13:13 +00:00
34717b3ea8 nn/test_convolution to run in serial (#91113)
unfortunately it takes 50 minutes on slow gradcheck but thats on periodic

ends up taking up >6000 MB of space (7440 available)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91113
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2022-12-20 19:12:43 +00:00
dabf515c18 [cuDNN][cuDNN V8 API] (re-re-re-open) cuDNN V8 API on by default (#91117)
Re-opening following #91025

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91117
Approved by: https://github.com/ngimel
2022-12-20 18:52:29 +00:00
28ceccec21 cleanup old python_compat code (#91162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91162
Approved by: https://github.com/ezyang
2022-12-20 18:13:19 +00:00
346fd04076 Set cmake PATH on macos to address libzstd flakiness (#91142)
This is to address the recent flakiness issue on MacOS ARM64 https://hud.pytorch.org/failure/Library%20not%20loaded%3A%20%40rpath%2Flibzstd.1.dylib.

From what I see, the immediate cause is that `cmake` exec under `/Users/ec2-user/runner/_work/_temp/miniconda/pkgs/cmake-3.22.1-hae769c0_0/bin/` is used instead of the expected one under the temp CONDA_ENV, i.e. `/Users/ec2-user/runner/_work/_temp/conda_environment_3736476178/bin`.  I'm not quite sure what is the reason behind this flaky behavior, so I want to try a catch-all fix by setting the cmake PATH correctly

This PR also prints some debugging information w.r.t cmake PATH, and cleans up some legacy code in `macos-test.sh` script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91142
Approved by: https://github.com/ZainRizvi
2022-12-20 17:35:05 +00:00
84e73e1269 [inductor] small CI improvements (#91140)
Summary: 1) Increase timm_model download retry times; 2) Skip certain
random triton failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91140
Approved by: https://github.com/williamwen42
2022-12-20 17:26:12 +00:00
6a757f1cbb Cleanup Windows pip dependencies (#88862)
The new Windows AMI from https://github.com/pytorch/test-infra/pull/1065 is now ready. All Windows pip dependencies are now part of the Windows AMI and can be cleaned up from the CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88862
Approved by: https://github.com/ZainRizvi
2022-12-20 17:19:24 +00:00
b63f0311a5 [MPS] Add floor_divide() op and its test case (#91126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91126
Approved by: https://github.com/malfet
2022-12-20 17:02:29 +00:00
aec09eeb3a [FSDP][7/N] Support replicate in fully_shard (#91044)
This PR supports nesting `replicate` in `fully_shard`.
- The PR achieves this by treating `replicate`-annotated modules are ignored modules. This means that all submodules in the `replicate`-annotated module's subtree are ignored, including nested `fully_shard`-annotated modules, which is the desired behavior.

---

This PR reworks some tree traversal.

One end goal is for `state._handles` to follow the same order for both the wrapper and composable paths. This implies that `_get_fsdp_handles()` returns the same value for both paths.
- The helper function `_get_fully_sharded_module_to_states()` now follows a left-to-right DFS from each fully sharded module instead of a BFS. The left-to-right DFS follows `.modules()` order.
- The composable auto "wrap" initialization function `_init_param_handles_from_module()` follows the reverse left-to-right DFS order. As noted in the code comments, this initialization order is a valid reverse topological sort, but it differs from the wrapper path. This is the _only_ difference with respect to initialization order through the entire process.
```
mod: Module(
    submod1: Submodule()
    submod2: Submodule(
        subsubmod: Subsubmodule(),
    ),
)
```
For left-to-right DFS, the order is `mod`, `submod1`, `submod2`, `subsubmod`. (For context, right-to-left DFS would be `mod`, `submod2`, `subsubmod`, `submod1`. In other words, the left-to-right vs. right-to-left corresponds to `.children()` vs. `reversed(.children())` respectively.) Then, reverse left-to-right DFS is `subsubmod`, `submod2`, `submod1`, `mod`, which is a valid initialization order. However, the wrapper auto wrap initialization order would be `submod1`, `subsubmod`, `submod2`, `mod` since it directly follows a left-to-right DFS and initializes as a part of the recursive DFS logic.
- At the end of `_init_param_handles_from_module()`, we reverse the newly populated `state._handles`, so this is the reverse reverse left-to-right DFS order, which is equivalent to the left-to-right DFS order. Thus, `state._handles` has the same order for both paths.

Another goal is for `_get_fsdp_states()` to not traverse into any submodule that is annotated with an API that is not compatible with `fully_shard` (e.g. `replicate`). To achieve this while preserving that `_get_fsdp_states()` follows `.modules()` order, we again use a left-to-right DFS.

The reason the DFSs may look strange is because I implemented them non-recursively, which requires a stack.

- `test_get_fully_sharded_module_to_states()` in `test_utils.py` checks the traversal order of `_get_fully_sharded_module_to_states()`.
- `test_policy()` in `test_fully_shard.py` checks the traversal order returned by `_get_fsdp_handles()`.

---

Due to a circular dependency issue, we must move the graph/tree traversal helpers to their own file `_traversal_utils.py`, and any usages must import the entire file like `import torch.distributed.fsdp._traversal_utils as traversal_utils` instead of `from torch.distributed.fsdp._traversal_utils import ...`.

The cycle comes from the fact that the traversals require `_composable()`, which requires `_get_registry()` from `composable/contract.py`, which when imported, imports `composable/fully_shard.py`, which requires the traversals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91044
Approved by: https://github.com/mrshenli
2022-12-20 16:49:18 +00:00
e81ccfd1ed [FSDP][6/N] Add note explaining idioms for _FSDPState traversal (#90959)
This adds a note to explain how to do traversal in the new code base. These traversal helper methods were introduced in [1/N], [3/N], and [5/N].

I am working on updating the traversal helpers to account for other composable APIs (e.g. `replicate`). The rule is that the traversal should not proceed into an incompatible API's tree. This will be needed for `fully_shard` to be above `replicate`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90959
Approved by: https://github.com/mrshenli
2022-12-20 16:49:18 +00:00
32fde53713 [FSDP][5/N] Add manual "wrapping" support for fully_shard (#90874)
This PR adds manual "wrapping" support for `fully_shard`. For example, for
```
fully_shard(mod.sub)
fully_shard(mod)
```
`mod.sub` and `mod` will share the same FSDP data structures.

To have parity with wrapper FSDP, this PR only checks support for when each manual application of `fully_shard` passes `policy=None`. Hybrid auto / manual wrapping is not in scope for this PR since it is not supported for wrapper FSDP either. I can follow up to either add support properly or raise and error early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90874
Approved by: https://github.com/mrshenli
2022-12-20 16:49:15 +00:00
da9af9868e [FSDP][4/N] Refactor func to share state/init handle attrs (#90871)
For `limit_all_gathers`, if we do not enforce that they all have the same value, then the entire semantics guaranteed by the `bool` can be violated. It could be as if none of them set that value to be `True`.

For `use_orig_params`, optimizer state dict assumes that the value is the same for all FSDP instances.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90871
Approved by: https://github.com/mrshenli
2022-12-20 16:49:13 +00:00
3194281ca7 Revert "use scatter_add for index_add when dim is the most inner dim (#88729)"
This reverts commit 13dbad63696f0ad39d63e4457eeebf800fb80dff.

Reverted https://github.com/pytorch/pytorch/pull/88729 on behalf of https://github.com/desertfire due to causing inductor test failure
2022-12-20 15:19:54 +00:00
07c340bb2a Remove debug code (#91148)
Removes some debug code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91148
Approved by: https://github.com/desertfire, https://github.com/williamwen42
2022-12-20 15:00:55 +00:00
2d68cc4bc2 Add cu118 workflows (#90826)
CC @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90826
Approved by: https://github.com/atalman
2022-12-20 14:34:18 +00:00
289f06434c [dynamo] check buffers when checking accuracy (#91037)
Tested by running `python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=inductor_torchbench_float32_training_cuda_performance.csv --training --inductor --no-skip --dashboard --only mobilenet_v2 --cold_start_latency` and breakpointing after the changes to inspect buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91037
Approved by: https://github.com/anijain2305
2022-12-20 13:57:25 +00:00
17b80bfaf3 Update patch release cherry pick condition (#90220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90220
Approved by: https://github.com/ezyang, https://github.com/seemethere
2022-12-20 13:56:43 +00:00
0eb45d546c Bind autograd current Node for debugging purposes (#90867)
This allows to know at any point during the backward pass what is running and where the Node currently running was created at:
```python
import torch
from torch.utils._python_dispatch import TorchDispatchMode
from torch.autograd import detect_anomaly

class MyMode(TorchDispatchMode):
    def __torch_dispatch__(self, func, types, args, kwargs=None):
        node = torch._C._current_autograd_node()
        print(f"Running {func} from within {node}")
        if node is not None:
            print("The Node was created at:")
            print("\n  ".join(node.metadata["traceback_"]))
        return func(*args, **kwargs or {})

with MyMode(), detect_anomaly():
    print("FW")
    a = torch.rand(10, requires_grad=True)
    b = a.mul(2)
    b = b.div(3)
    b = b.sum()
    print("BW")
    b.backward()
```

Gives
```
$ python foo.py
foo.py:15: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with MyMode(), detect_anomaly():
FW
Running aten.rand.default from within None
Running aten.mul.Tensor from within None
Running aten.div.Tensor from within None
Running aten.sum.default from within None
BW
Running aten.ones_like.default from within None
Running aten.expand.default from within <SumBackward0 object at 0x7fa40c0c6dc0>
The Node was created at:
  File "foo.py", line 20, in <module>
    b = b.sum()

Running aten.isnan.default from within <SumBackward0 object at 0x7fa40c0c6500>
The Node was created at:
  File "foo.py", line 20, in <module>
    b = b.sum()

Running aten.any.default from within <SumBackward0 object at 0x7fa32b23a780>
The Node was created at:
  File "foo.py", line 20, in <module>
    b = b.sum()

Running aten._local_scalar_dense.default from within <SumBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 20, in <module>
    b = b.sum()

Running aten.div.Tensor from within <DivBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 19, in <module>
    b = b.div(3)

Running aten.isnan.default from within <DivBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 19, in <module>
    b = b.div(3)

Running aten.any.default from within <DivBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 19, in <module>
    b = b.div(3)

Running aten._local_scalar_dense.default from within <DivBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 19, in <module>
    b = b.div(3)

Running aten.mul.Tensor from within <MulBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 18, in <module>
    b = a.mul(2)

Running aten.isnan.default from within <MulBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 18, in <module>
    b = a.mul(2)

Running aten.any.default from within <MulBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 18, in <module>
    b = a.mul(2)

Running aten._local_scalar_dense.default from within <MulBackward0 object at 0x7fa40c0c9190>
The Node was created at:
  File "foo.py", line 18, in <module>
    b = a.mul(2)

Running aten.detach.default from within <AccumulateGrad object at 0x7fa40c0c9730>
The Node was created at:
  File "foo.py", line 18, in <module>
    b = a.mul(2)

Running aten.detach.default from within <AccumulateGrad object at 0x7fa40c0c94b0>
The Node was created at:
  File "foo.py", line 18, in <module>
    b = a.mul(2)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90867
Approved by: https://github.com/soulitzer
2022-12-20 13:41:43 +00:00
13dbad6369 use scatter_add for index_add when dim is the most inner dim (#88729)
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.

`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.

### Testing

- Single core:

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003

- Single socket (28 cores):

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2022-12-20 13:12:36 +00:00
63b8ecc415 [CUDA12] Make PyTorch compatible with CUDA 12 (#91118)
Fix the failure when building PyTorch from source code using CUDA 12

```
In file included from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAFunctions.h:12,
                 from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAStream.h:10,
                 from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAGraphsC10Utils.h:3,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.h:5,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:2:
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp: In member function ‘void at::cuda::CUDAGraph::capture_end()’:
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:75: warning: converting to non-pointer type ‘long long unsigned int’ from NULL [-Wconversion-null]
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
                                                                           ^
/home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAException.h:31:42: note: in definition of macro ‘C10_CUDA_CHECK’
     C10_UNUSED const cudaError_t __err = EXPR;                           \
                                          ^~~~
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:5: note: in expansion of macro ‘AT_CUDA_CHECK’
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
     ^~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:75: error: too many arguments to function ‘cudaError_t cudaGraphInstantiate(CUgraphExec_st**, cudaGraph_t, long long unsigned int)’
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
                                                                           ^
/home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAException.h:31:42: note: in definition of macro ‘C10_CUDA_CHECK’
     C10_UNUSED const cudaError_t __err = EXPR;                           \
                                          ^~~~
/home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:168:5: note: in expansion of macro ‘AT_CUDA_CHECK’
     AT_CUDA_CHECK(cudaGraphInstantiate(&graph_exec_, graph_, NULL, NULL, 0));
     ^~~~~~~~~~~~~
In file included from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAStream.h:6,
                 from /home/jianyuhuang/Work/Github/pytorch/c10/cuda/CUDAGraphsC10Utils.h:3,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.h:5,
                 from /home/jianyuhuang/Work/Github/pytorch/aten/src/ATen/cuda/CUDAGraph.cpp:2:
/usr/local/cuda/include/cuda_runtime_api.h:11439:39: note: declared here
 extern __host__ cudaError_t CUDARTAPI cudaGraphInstantiate(cudaGraphExec_t *pGraphExec, cudaGraph_t graph, unsigned long long flags __dv(0));
                                       ^~~~~~~~~~~~~~~~~~~~
ninja: build stopped: subcommand failed.
```

```
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp: In function ‘void torch::cuda::shared::initCudartBindings(PyObject*)’:
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:13: error: ‘cudaOutputMode_t’ was not declared in this scope
   py::enum_<cudaOutputMode_t>(
             ^~~~~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:13: note: suggested alternative: ‘cudaGraphNode_t’
   py::enum_<cudaOutputMode_t>(
             ^~~~~~~~~~~~~~~~
             cudaGraphNode_t
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:34:29: error: template argument 1 is invalid
   py::enum_<cudaOutputMode_t>(
                             ^
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:38:30: error: ‘cudaKeyValuePair’ was not declared in this scope
       .value("KeyValuePair", cudaKeyValuePair)
                              ^~~~~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:39:21: error: ‘cudaCSV’ was not declared in this scope
       .value("CSV", cudaCSV);
                     ^~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:39:21: note: suggested alternative: ‘cudart’
       .value("CSV", cudaCSV);
                     ^~~~~~~
                     cudart
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:99:7: error: ‘cudaProfilerInitialize’ was not declared in this scope
       cudaProfilerInitialize);
       ^~~~~~~~~~~~~~~~~~~~~~
/home/jianyuhuang/Work/Github/pytorch/torch/csrc/cuda/shared/cudart.cpp:99:7: note: suggested alternative: ‘cudaProfilerStart’
       cudaProfilerInitialize);
       ^~~~~~~~~~~~~~~~~~~~~~
       cudaProfilerStart
ninja: build stopped: subcommand failed.
```

After these fixes, we can see CUDA 12 is successfully built with OSS PyTorch instructions.

USE_CUDA=1 python setup.py develop  2>&1 | tee compile.log
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91118
Approved by: https://github.com/ngimel, https://github.com/brad-mengchi
2022-12-20 10:58:53 +00:00
7c58f1d4e8 Update dynamo xla test to make it part of the xla CI (#91130)
XLA side pr to enable the test https://github.com/pytorch/xla/pull/4370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91130
Approved by: https://github.com/shunting314
2022-12-20 09:29:44 +00:00
29b119d04d [Checkpoint] Add test for fsdp model state saving and loading with/without resharding (#90950)
As title.

https://github.com/pytorch/pytorch/issues/90960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90950
Approved by: https://github.com/fduwjj
2022-12-20 08:13:53 +00:00
7330eabe36 fully_shard load state_dict (#90945)
Ensures that load_state_dict for fully_shard works:
- Don't add back FSDP prefix
- Small fix to ensure mixed precision check for buffers work

Follow ups:
- state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations
- No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90945
Approved by: https://github.com/awgu
2022-12-20 07:26:43 +00:00
95a115dd07 Revert "use libdevice for tanh (#90889)"
This reverts commit 0148809131f494b842baf50d1f392f7404b87b44.

Reverted https://github.com/pytorch/pytorch/pull/90889 on behalf of https://github.com/ngimel due to breaking test
2022-12-20 06:29:45 +00:00
511fbad830 [Dynamo] Fix builder for class with metaclass (#90807)
Fixes Meta internal user case: a class with metaclass can't be identified as ```UserDefinedClassVariable```.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90807
Approved by: https://github.com/jansel
2022-12-20 05:02:28 +00:00
ef2bb9ca04 Revert "When nopython=True, Dynamo can't allow graph breaks. (#90970)"
This reverts commit 7e9bf2ed860b8b60d252eead4cc457c3fe5f1667.

Reverted https://github.com/pytorch/pytorch/pull/90970 on behalf of https://github.com/kit1980 due to The inductor test fails on master every time after this PR
2022-12-20 04:43:26 +00:00
0f57e7f2d9 Do not run inductor perf test with postnightly branch (#91133)
Inductor performance test job would be triggered every night associated with the pull request push event from https://github.com/pytorch/pytorch/pull/27167
Since we are already running three times a day the job, there is no need to run this test with postnightly branch. Plus, this postnightly branch currently fails dozens of tests due to "docker argument too long error".

Example workflow: https://github.com/pytorch/pytorch/actions/runs/3731250111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91133
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/seemethere, https://github.com/desertfire
2022-12-20 03:40:54 +00:00
857ed2d7dd [Inductor] Replace graph.eliminate_dead_code() with graph.erase_node() in Permute Fusion (#91014)
Summary: As FX passes of permute fusion run before functionalization, it might be safer to replace `graph.eliminate_dead_code()` with `graph.erase_node()` to avoid cases that `graph.eliminate_dead_code()` might remove mutation nodes

Test Plan: Unit Tests & CI

Reviewed By: jansel

Differential Revision: D41904755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91014
Approved by: https://github.com/jansel
2022-12-20 03:26:25 +00:00
d6dd2e97da [inductor] Rewrite Triton templates + epilogue fusion (retry) (#91105)
https://github.com/pytorch/pytorch/pull/90738 seems a bit borked. ghimport fails on it, and I unlinked it from the Phabricator diff, but it still won't land.  This is an exact copy that PR without using ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91105
Approved by: https://github.com/ngimel
2022-12-20 02:38:23 +00:00
3bd37ff2d5 Removing invalid git option when updating submodules (#91132)
Same as this: https://github.com/pytorch/builder/pull/1246
Related to following git commit: 51243f9f0f
Which makes jobs = 0 invalid.

Nightlies for MacOS are failing because of this issue: https://github.com/pytorch/pytorch/actions/runs/3729522653/jobs/6325523414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91132
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2022-12-20 02:17:02 +00:00
0148809131 use libdevice for tanh (#90889)
Per title
I see slight differences in perf with this implementation, where standalone tanh is slightly slower for a tensor of 4000000
 elements (20.4 us instead of 19.4us), other sizes are within noise.
 @bertmaher could you check if it affects your benchmarks?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90889
Approved by: https://github.com/bertmaher, https://github.com/anijain2305
2022-12-20 02:11:53 +00:00
30edd39bdc Fix non-existing parameters in docstrings in benchmarks (#91115)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115
Approved by: https://github.com/clee2000
2022-12-20 02:07:32 +00:00
99bd8d12e1 Fix non-existing parameters in docstrings in misc places (#91121)
This should be the last continuation of https://github.com/pytorch/pytorch/pull/90505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91121
Approved by: https://github.com/clee2000
2022-12-20 02:01:37 +00:00
0210d508cc Fix terminology within linalg.slogdet docs (#91129)
This issue was raised in https://github.com/data-apis/array-api/pull/567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91129
Approved by: https://github.com/kit1980
2022-12-20 01:55:27 +00:00
a5eb564ba4 [Quant] lower fused LinearTanh for onednn backend (#89188)
**Summary**
Add fuser method and quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering are supported only in FX mode.

**Test plan**
python test_quantization.py TestFuseFx TestQuantizeFx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89188
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-20 01:30:21 +00:00
666d218055 TorchDynamo: set output stride using eager output for cat (#89477)
For squeezenet1_1 and densenet121 model, the cat's post op is always convolution, for channels last path, the currently cat path always set the output format as contiguous format, but convolution's input requires channels last, there always has a memory copy before convolution. This PR use eaged model's output format to set the format to reduce the memory copy.

Before:
```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ik/cikrybpw4xhois4wll6h5afsswjrhpsb6gslcxrntzqtlyw2btey.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       const float* __restrict__ in_ptr2,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    #pragma GCC ivdep
    for(long i0=0; i0<3; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<256; i1+=1)
        {
            {
                {
                    auto tmp0 = in_ptr0[i0 + (3*i1)];
                    out_ptr0[i1 + (256*i0)] = tmp0;
                }
            }
        }
    }
    #pragma GCC ivdep
    for(long i0=0; i0<3; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<256; i1+=1)
        {
            {
                {
                    auto tmp0 = in_ptr1[i0 + (3*i1)];
                    out_ptr1[i1 + (256*i0)] = tmp0;
                }
            }
        }
    }
    #pragma GCC ivdep
    for(long i0=0; i0<6; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<256; i1+=1)
        {
            {
                {
                    auto tmp0 = in_ptr2[i1 + (256*i0)];
                    out_ptr2[i0 + (6*i1)] = tmp0;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1 = args
    args.clear()
    buf2 = empty_strided((1, 6, 16, 16), (1536, 256, 16, 1), device='cpu', dtype=torch.float32)
    buf0 = as_strided(buf2, (1, 3, 16, 16), (1536, 256, 16, 1))  # alias
    buf1 = as_strided(buf2, (1, 3, 16, 16), (1536, 256, 16, 1), 768)  # alias
    buf3 = empty_strided((1, 6, 16, 16), (1536, 1, 96, 6), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()), c_void_p(buf3.data_ptr()))
    del arg2_1
    del arg3_1
    del buf0
    del buf1
    del buf2
    buf4 = aten.convolution(buf3, arg0_1, arg1_1, (1, 1), (0, 0), (1, 1), False, (0, 0), 1)
    assert_size_stride(buf4, (1, 3, 16, 16), (768, 1, 48, 3))
    del arg0_1
    del arg1_1
    return (buf4, )

```

after:
```
from ctypes import c_void_p, c_long
import torch
import random
from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/ik/cikrybpw4xhois4wll6h5afsswjrhpsb6gslcxrntzqtlyw2btey.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1)
{
    #pragma GCC ivdep
    for(long i0=0; i0<256; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            {
                {
                    auto tmp0 = in_ptr0[i1 + (3*i0)];
                    out_ptr0[i1 + (6*i0)] = tmp0;
                }
            }
        }
    }
    #pragma GCC ivdep
    for(long i0=0; i0<256; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            {
                {
                    auto tmp0 = in_ptr1[i1 + (3*i0)];
                    out_ptr1[i1 + (6*i0)] = tmp0;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1 = args
    args.clear()
    buf2 = empty_strided((1, 6, 16, 16), (1536, 1, 96, 6), device='cpu', dtype=torch.float32)
    buf0 = as_strided(buf2, (1, 3, 16, 16), (1536, 1, 96, 6))  # alias
    buf1 = as_strided(buf2, (1, 3, 16, 16), (1536, 1, 96, 6), 3)  # alias
    kernel_cpp_0(c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf1.data_ptr()))
    del arg2_1
    del arg3_1
    del buf0
    del buf1
    buf3 = aten.convolution(buf2, arg0_1, arg1_1, (1, 1), (0, 0), (1, 1), False, (0, 0), 1)
    assert_size_stride(buf3, (1, 3, 16, 16), (768, 1, 48, 3))
    del arg0_1
    del arg1_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89477
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2022-12-20 01:09:17 +00:00
b309599d1b Add catch socket.gaierror for _matches_machine_hostname (#91119)
Summary: Add catch `socket.gaierror` for _matches_machine_hostname

Test Plan: Unit tests again

Reviewed By: kurman

Differential Revision: D42152245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91119
Approved by: https://github.com/kurman
2022-12-20 00:57:53 +00:00
ebea45fe41 [MPS] Fix the assert in Garbage Collector (#91106)
- Enable high watermark ratio to limit the memory allocations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91106
Approved by: https://github.com/kulinseth
2022-12-20 00:53:24 +00:00
1d3e7fcc3b [pytorch profiler] Add step tracker logic to handle multiple sources of step increments (#90880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90880

# Summary
Enables multiple step trackers. Currently we only had one place to mark that a step() has occurred in the program. This was via pytorch profiler step().
We are now working on adding an Optimizer step hook - https://github.com/pytorch/pytorch/issues/88446
- This could mean programs that already call profiler.step() every iteration can end up double incrementing steps
- If a model uses multiple optimizers we can also have double or more counting of the step.

## Solution
We fix this by adding a layer of abstraction before calling step() to the kineto library. The idea is to maintain steps per requester in a dictionary
```
{
   "ProfilerStep": 100,  # triggered by profiler step() call
   "Optimizer1Step": 100,   # Optimizer 1 or 2 are just examples, could be SGD, Adam etc
   "Optimizer2Step": 100,
}
```
To figure out the global step count just take max on the dict values (100).
```
{
   "ProfilerStep": 100,
   "Optimizer1Step": 101,   # Optimizer1 got incremented first say
   "Optimizer2Step": 100,
}
```
Then global step count is 101

## Calling kineto
We only call the kineto step() function when global count increments.

# Test Plan:
Added a unit test
   buck2 run mode/dev-nosan caffe2/test:profiler

Differential Revision: D41751157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90880
Approved by: https://github.com/chaekit
2022-12-20 00:48:01 +00:00
41846e205e [torch.func] Setup torch.func, populate it with all transforms (#91016)
This PR sets up torch.func and populates it with the following APIs:
- grad
- grad_and_value
- vjp
- jvp
- jacrev
- jacfwd
- hessian
- functionalize
- vmap

It also renames all instances of `functorch` in the APIs for those docs
to `torch.func`.

We rewrite the `__module__` fields on some of the above APIs so that the
APIs fit PyTorch's public api definition.
- For an API to be public, it must have a `__module__` that points to a
  public PyTorch submodule. However, `torch._functorch.eager_transforms`
  is not public due to the leading underscore.
- The solution is to rewrite `__module__` to point to where the API is
  exposed (torch.func). This is what both Numpy and JAX do for their
  APIs.
- h/t pmeier in
  https://github.com/pytorch/pytorch/issues/90284#issuecomment-1348595246
  for idea and code
- The helper function, `exposed_in`, is confined to
  torch._functorch/utils for now because we're not completely sure if
  this should be the long-term solution.

Implication for functorch.* APIs:
- functorch.grad is the same object as torch.func.grad
- this means that the functorch.grad docstring is actually the
  torch.func.grad docstring and will refer to torch.func instead of
  functorch.
- This isn't really a problem since the plan on record is to deprecate
  functorch in favor of torch.func. We can fix these if we really want,
  but I'm not sure if a solution is worth maintaining.

Test Plan:
- view docs preview

Future:
- vmap should actually just be torch.vmap. This requires an extra step
  where I need to test internal callsites, so, I'm separating it into a
  different PR.
- make_fx should be in torch.func to be consistent with `import
  functorch`. This one is a bit more of a headache to deal with w.r.t.
  public api, so going to deal with it separately.
- beef up func.rst with everything else currently on the functorch
  documention website. func.rst is currently just an empty shell.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91016
Approved by: https://github.com/samdow
2022-12-20 00:00:52 +00:00
cad1ce6158 Stop using :attr: in functorch docs (#91015)
We're using :attr: wrong. :attr: refers to an attribute of a Python
object, not the parameter to a function:
- https://www.sphinx-doc.org/en/master/usage/restructuredtext/domains.html#role-py-attr

This leads to some weird things when moving to torch.func: sphinx
decides to link torch.func for :attr:`func`

Test Plan:
- docs preview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91015
Approved by: https://github.com/samdow
2022-12-20 00:00:52 +00:00
7e9bf2ed86 When nopython=True, Dynamo can't allow graph breaks. (#90970)
I count the number of sub-graphs (for tiny-GPT2 in huggingface) by
```
    class GraphCaptureCompiler:
        def __init__(self):
            self.captured_graphs = []
        def compile(self, gm, example_inputs):
            self.captured_graphs.append(gm)
            return gm
    compiler = GraphCaptureCompiler()
    torch._dynamo.optimize(compiler, nopython=True)(Wrapper(fn))(*args)
```

Although `len(compiler.captured_graphs)` is 2, no error was thrown during the compilation. This observation conflicts with `nopython=True`. After some digging, I found a check is missed before making graph break. This PR adds it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90970
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-12-19 23:43:28 +00:00
d1772aff60 Autocast support for scaled_dot_product_attention (#91066)
Summary: Autocast support for scaled_dot_product_attention

Test Plan: sandcastle and guthub cicd

Differential Revision: D42085525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91066
Approved by: https://github.com/ngimel, https://github.com/drisspg
2022-12-19 23:42:26 +00:00
fadf222661 Propagate guard failures to userland (#91053)
Previously we would abort() but this is annoying when you're running
pytest or something.  Don't hard crash.

It would be nice to apply this treatment to the other uses of CHECK
macro in this file, but it was just guards that was bothering me.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91053
Approved by: https://github.com/jansel
2022-12-19 23:39:48 +00:00
7bc3467fff Delete dynamic_propagation config (#91040)
Per https://github.com/pytorch/torchdynamo/issues/1949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91040
Approved by: https://github.com/jansel
2022-12-19 22:42:11 +00:00
7ebc45eadd [dynamo] Better error message for bad timm model name (#91049)
Fixes https://github.com/pytorch/torchdynamo/issues/1995

Running `python benchmarks/dynamo/timm_models.py --performance --float32 -dcuda --output=out.csv --training --inductor --only bad_model_name` gives
```
Traceback (most recent call last):
  File "benchmarks/dynamo/timm_models.py", line 338, in <module>
    main(TimmRunnner())
  File "/scratch/williamwen/work/pytorch/benchmarks/dynamo/common.py", line 1660, in main
    return maybe_fresh_cache(run, args.cold_start_latency and args.only)(
  File "/scratch/williamwen/work/pytorch/benchmarks/dynamo/common.py", line 833, in inner
    return fn(*args, **kwargs)
  File "/scratch/williamwen/work/pytorch/benchmarks/dynamo/common.py", line 2000, in run
    ) = runner.load_model(device, model_name, batch_size=batch_size)
  File "benchmarks/dynamo/timm_models.py", line 215, in load_model
    raise RuntimeError(f"Failed to load model '{model_name}'")
RuntimeError: Failed to load model 'bad_model_name'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91049
Approved by: https://github.com/ezyang
2022-12-19 22:37:34 +00:00
322e4b4c8a set -Wsuggest-override for builds (#89852)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/89852).
* __->__ #89852
* #89851

set -Wsuggest-override for builds

Summary: This was flagged by a Meta internal build.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89852
Approved by: https://github.com/malfet
2022-12-19 22:08:47 +00:00
8ecb49b8fb [MPS] Add Inverse op. (#90428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90428
Approved by: https://github.com/DenisVieriu97, https://github.com/malfet
2022-12-19 22:00:12 +00:00
58b5a9df00 Update to sdp benchmark to take into account pt2.0 stack (#90096)
Updates to sdp benchmark to fix failures due to sdp being included into nn.f.mha. As well compare against compiled version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90096
Approved by: https://github.com/cpuhrsch
2022-12-19 21:59:21 +00:00
909b7ca92a [torchgen] Move Executorch codegen logic into torchgen (#90806)
## Codegen entry point

Main logic and Executorch codegen entry: `gen_executorch.py`.

`RegisterCodegenUnboxedKernels.cpp`:
```cpp
register_operators({
	Operator(
		"aten::add.out",
		[](EValue** stack) {
			EValue& self = *stack[0];
			EValue& other = *stack[1];
			EValue& alpha = *stack[2];
			EValue& out = *stack[3];

			const at::Tensor & self_base = self.to<at::Tensor>();
			const at::Tensor & other_base = other.to<at::Tensor>();
			const at::Scalar & alpha_base = alpha.to<at::Scalar>();
			at::Tensor & out_base = out.to<at::Tensor>();

			EXECUTORCH_SCOPE_PROF("native_call_add.out");
			torch::executor::aten::add_outf(self_base, other_base, alpha_base, out_base);
	})
);
```

`Functions.h`:
```cpp

namespace torch {
namespace executor {

namespace aten {

// aten::add_outf(Tensor self, Tensor other, Scalar alpha, *, Tensor(a!) out) -> Tensor(a!)
TORCH_API inline at::Tensor & add_outf(const at::Tensor & self, const at::Tensor & other, at::Scalar alpha, at::Tensor & out) {
    return at::add_outf(self, other, alpha, out);
}

} // namespace aten

} // namespace executor
} // namespace torch
```

* Unit tests: `test_executorch_gen.py`

CI job in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90806
Approved by: https://github.com/ezyang
2022-12-19 21:58:43 +00:00
679da8bd89 [torchgen] Move Executorch custom ops logic into torchgen (#90099)
## Logic to handle custom ops
We generate files for custom ops, so that they can be registered into PyTorch.

Generated files:
* `Register{dispatch_key}CustomOps.cpp` (dispatch_key = CPU), it's basically the same as vanilla PyTorch `RegisterCPU.cpp`. The only difference is that we bind to native functions directly.
* `Register{dispatch_key}Stub.cpp` (dispatch_key = CPU), register placeholder kernels for custom ops. Only used when there's no custom op kernel available.

As an example:
```cpp
namespace {

at::Tensor & wrapper_out_unsqueeze_out(const at::Tensor & self, int64_t dim, at::Tensor & out) {
    // No device check

  // DeviceGuard omitted
  return torch::executor::native::unsqueeze_out(self, dim, out);
}
} // anonymous namespace

TORCH_LIBRARY_IMPL(aten, CPU, m) {

m.impl("unsqueeze.out",
TORCH_FN(wrapper_out_unsqueeze_out));
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90099
Approved by: https://github.com/ezyang
2022-12-19 21:58:43 +00:00
ca52f63fc0 [torchgen] Move Executorch unboxing logic into torchgen (#90098)
This PR adds `unboxing.py` which converts a `EValue` (similar to `IValue`) to its corresponding C++ type, based on the `ExecutorchCppSignature`.

Added unit tests to it in `test_executorch_unboxing.py`. Notice that this unboxing logic should work for both ATen types and Executorch types, hence the unit tests are parametrized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90098
Approved by: https://github.com/ezyang
2022-12-19 21:58:43 +00:00
f02e93b584 jacrev : Support chunked computation (#89376)
Ref: https://github.com/pytorch/functorch/issues/680

We introduce a kwarg `chunk_size` in `jacrev` to control whether the Jacobian computation should be chunked and if so then `chunk_size` will dictate the maximum size of the chunks used.

We try two approaches,
* Stacked Approach: Append the intermediate computation to a list and then stack those results.
* Pre-allocation Approach: Pre-allocate a zeros tensor and copy chunked computation into it.

For Memory Benchmark, see https://github.com/pytorch/pytorch/pull/89376#issuecomment-1348479098

Benchmark CPU : Performs better with more chunks/ smaller chunk_size.

NOTE: There seems to be a lot of noise for shape `(64, 64)`.

<details>

```
[----------------------------------------------- jacrev : device cpu : chunks 2 -----------------------------------------------]
                                     |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: ---------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 2080     |               76.2            |          50.9        |                  80.1
      (128, 128) : chunk_size 8256   |             1172.8            |         783.3        |                1225.5
      (128, 144) : chunk_size 9288   |             1475.1            |         990.4        |                1548.3
      (144, 144) : chunk_size 10440  |             1871.3            |        1254.4        |                1971.2

Times are in milliseconds (ms).

[----------------------------------------------- jacrev : device cpu : chunks 3 ----------------------------------------------]
                                    |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: --------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 1386    |               39.9            |          25.8        |                  58.8
      (128, 128) : chunk_size 5504  |             1182.6            |         782.2        |                1229.7
      (128, 144) : chunk_size 6192  |             1483.6            |         995.4        |                1550.6
      (144, 144) : chunk_size 6960  |             1879.1            |        1257.7        |                1960.5

Times are in milliseconds (ms).

[----------------------------------------------- jacrev : device cpu : chunks 4 ----------------------------------------------]
                                    |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: --------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 1040    |               41.7            |          50.6        |                  29.1
      (128, 128) : chunk_size 4128  |             1171.6            |         782.3        |                1226.7
      (128, 144) : chunk_size 4644  |             1482.2            |         994.6        |                1550.9
      (144, 144) : chunk_size 5220  |             1870.2            |        1254.5        |                1961.4

Times are in milliseconds (ms).

[--------------------------------------------- jacrev : device cpu : chunks 100 ---------------------------------------------]
                                   |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: -------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 41     |               46.8            |          50.5        |                  46.4
      (128, 128) : chunk_size 165  |              622.2            |         775.2        |                 656.0
      (128, 144) : chunk_size 185  |              803.9            |         987.3        |                 866.9
      (144, 144) : chunk_size 208  |             1021.1            |        1251.2        |                1088.2

Times are in milliseconds (ms).

[--------------------------------------------- jacrev : device cpu : chunks 200 ---------------------------------------------]
                                   |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: -------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 20     |               60.9            |          50.2        |                  62.3
      (128, 128) : chunk_size 82   |              583.1            |         779.4        |                 634.3
      (128, 144) : chunk_size 92   |              834.1            |        1005.8        |                 472.3
      (144, 144) : chunk_size 104  |             1053.6            |        1277.0        |                1033.9

Times are in milliseconds (ms).

[--------------------------------------------- jacrev : device cpu : chunks 300 --------------------------------------------]
                                  |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: ------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 13    |              77.7             |          50.4        |                  79.6
      (128, 128) : chunk_size 55  |             578.9             |         782.3        |                 626.9
      (128, 144) : chunk_size 61  |             718.2             |        1024.9        |                 800.4
      (144, 144) : chunk_size 69  |             919.7             |        1313.7        |                1023.0

Times are in milliseconds (ms).
```

</details>

Benchmark CUDA: Performs better with less chunks/bigger chunk_size.

<details>

```
[--------------------------------------------- jacrev : device cuda:1 : chunks 2 ----------------------------------------------]
                                     |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: ---------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 2080     |             1485.7            |         923.8        |                1632.3
      (128, 128) : chunk_size 8256   |            25390.2            |       14103.2        |               33557.4
      (128, 144) : chunk_size 9288   |              801.7            |       16854.1        |               42894.6
      (144, 144) : chunk_size 10440  |             1003.5            |       21386.5        |               59648.5

Times are in microseconds (us).

3 / 3 : Shape (144, 144) : Device cuda:1 : chunks: 3
[--------------------------------------------- jacrev : device cuda:1 : chunks 3 ---------------------------------------------]
                                    |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: --------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 1386    |             1474.5            |         924.5        |                1655.5
      (128, 128) : chunk_size 5504  |            25368.9            |       10156.0        |               34022.1
      (128, 144) : chunk_size 6192  |            25223.0            |       12933.7        |               56418.5
      (144, 144) : chunk_size 6960  |            24729.3            |       16367.4        |               68744.7

Times are in microseconds (us).

3 / 3 : Shape (144, 144) : Device cuda:1 : chunks: 4
[--------------------------------------------- jacrev : device cuda:1 : chunks 4 ---------------------------------------------]
                                    |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: --------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 1040    |             1489.2            |         924.4        |                 1679.6
      (128, 128) : chunk_size 4128  |            25370.4            |        8987.4        |                57201.3
      (128, 144) : chunk_size 4644  |            32239.1            |       10136.2        |                72406.5
      (144, 144) : chunk_size 5220  |            40994.3            |       12867.8        |               108653.4

Times are in microseconds (us).

3 / 3 : Shape (144, 144) : Device cuda:1 : chunks: 100
[------------------------------------------- jacrev : device cuda:1 : chunks 100 --------------------------------------------]
                                   |  with chunk_size and stacked  |  without chunk_size  |  with chunk_size and pre-allocated
1 threads: -------------------------------------------------------------------------------------------------------------------
      (64, 64) : chunk_size 41     |            21121.8            |         924.2        |               22753.5
      (128, 128) : chunk_size 165  |            23679.7            |       14284.4        |               26758.2
      (128, 144) : chunk_size 185  |            30082.3            |       18063.3        |               33553.5
      (144, 144) : chunk_size 208  |            38175.6            |       22839.5        |               42030.0

Times are in microseconds (us).
```

</details>

Benchmark Script

<details>

```python
import functorch
import torch
import itertools
import time
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
import sys
import pickle
from torch import profiler

import math

def prod(l):
    prod = 1
    for el in l:
        prod *= el

    return prod

def fn(x, y):
    return x + y, x.sum(0)

shapes = ((64, 64), (128, 128), (128, 144), (144, 144))

for device in ('cpu', 'cuda:1'):
    if device == 'cuda:1':
        chunks = (2, 3, 4, 100,)
    else:
        chunks = (2, 3, 4, 100, 200, 300)
    for chunk in chunks:
        results = []
        for shape in shapes:
            x = torch.zeros(*shape, dtype=torch.float, device=device)
            y = x.sum()
            chunk_size = (prod(shape) + prod(shape[1:])) // chunk
            jacrev_fn_chunked = functorch.jacrev(fn, (0, 1), chunk_size=chunk_size)
            jacrev_fn_chunked_pre = functorch.jacrev(fn, (0, 1), chunk_size=chunk_size, _preallocate_and_copy=True)
            jacrev_fn = functorch.jacrev(fn, (0, 1), chunk_size=None)

            tasks = [("jacrev_fn_chunked(x, y)", "with chunk_size and stacked"),
                     ("jacrev_fn(x, y)", "without chunk_size"),
                     ("jacrev_fn_chunked_pre(x, y)", "with chunk_size and pre-allocated"),]
            timers = [Timer(stmt=stmt, label=f"jacrev : device {device} : chunks {chunk}", sub_label=f"{(shape)} : chunk_size {chunk_size}", description=desc, globals=globals()) for stmt, desc in tasks]

            for i, timer in enumerate(timers):
                results.append(
                    timer.blocked_autorange(min_run_time=2.)
                )
                print(f"\r{i + 1} / {len(timers)} : Shape {shape} : Device {device} : chunks: {chunk}", end="")
                sys.stdout.flush()

        print()
        comparison = Compare(results)
        comparison.print()
```

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89376
Approved by: https://github.com/zou3519
2022-12-19 20:04:21 +00:00
e2dc60c6cb [Vulkan + Profiler] Add Timestamp Adjustment Algorithm (#90672)
@bypass-github-export-checks

This change ensures that vulkan event start/end times are correctly synced with their parent CPU times.

This sometimes requires increasing CPU event durations (to fully contain their child events) and delaying CPU event start times (to prevent overlaps), so this should not be used unless Vulkan events are being profiled and it is ok to use this modified timestamp/duration information instead of the the original information.

Differential Revision: [D39893109](https://our.internmc.facebook.com/intern/diff/D39893109/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39893109/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90672
Approved by: https://github.com/kimishpatel
2022-12-19 20:01:07 +00:00
0428de06ee [Vulkan + Profiler] Use 0 as Vulkan Event Durations During Tree Building (#90671)
@bypass-github-export-checks

This change ensures that parent/child relationships between vulkan events and their corresponding CPU events are established correctly. (Previously, if a vulkan event's duration was too long, it would not be made a child correctly).

This could be merged in with the preceding diff, but I wanted to separate it for now because I'm not sure what the most appropriate way to pass through the events and adjust the in_tree_building_ flag (the way I have it now seems a bit awkward), so keeping it separate for now makes it easier to understand/fix. Taylor if you have feedback on this let me know.

Differential Revision: [D40084788](https://our.internmc.facebook.com/intern/diff/D40084788/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90671
Approved by: https://github.com/kimishpatel
2022-12-19 19:58:53 +00:00
8c80a4684b [Vulkan + Profiler] Report Vulkan Events to Profiler in QueryPool (#90670)
@bypass-github-export-checks

With this change, we see Vulkan events reported on the generated chrometrace with proper names and durations.

However, their start/end times are not yet synced with the cpu event timeline, and their parent/child relationships are not established properly. These concerns will be addressed in future diffs

Differential Revision: [D39834807](https://our.internmc.facebook.com/intern/diff/D39834807/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39834807/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90670
Approved by: https://github.com/kimishpatel
2022-12-19 19:56:28 +00:00
193068cbcf [Vulkan + Profiler] Enable Processing Vulkan Events in Profiler (#90852)
@bypass-github-export-checks

This diff enables passing processing events in the profiler. Passing the events from QueryPool, and making sure vulkan events align with parent CPU events correctly will be handled later in this diff stack.

This diff was made by forking Taylor's scaffolding diff, D39779878, with a few changes:
- Rebasing + resolving merge conflicts
- Fixing (i.e. removing) auto import of profiler/containers.h
- Changing the activity type to CPU_OP which makes the vulkan events appear on chrometrace
- Moving timestamp adjustment scaffolding to D39893109

Differential Revision: [D39834805](https://our.internmc.facebook.com/intern/diff/D39834805/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90852
Approved by: https://github.com/mcr229
2022-12-19 19:54:32 +00:00
7badd0b9e6 [Vulkan] Store entries in a separate queue after resetting query pool (#90668)
@bypass-github-export-checks

We want to avoid tossing shader log entries when we reset the query pool so that the old entires can be used by the profiler after gathering all profiling data is done.

```get_shader_name_and_execution_duration_ns``` is used for accessing shader names/durations after they are flushed. It will be used with the torch profiler.

Differential Revision: [D40119621](https://our.internmc.facebook.com/intern/diff/D40119621/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90668
Approved by: https://github.com/kimishpatel
2022-12-19 19:52:21 +00:00
c345755013 [FSDP] Fix _mp_shard record_stream() (#91096)
IIUC, I dropped a needed `record_stream` call in https://github.com/pytorch/pytorch/pull/83665. I think this was because my original version of the PR retired the pre-unshard stream, but after some quantitative investigation, I brought it back.

- We allocate the `_mp_shard` in the pre-unshard stream.
731f417f60/torch/distributed/fsdp/_runtime_utils.py (L260-L263)
- For sharded strategies, we consume the `_mp_shard` only in the unshard stream (for all-gather).
731f417f60/torch/distributed/fsdp/_runtime_utils.py (L270-L273)
731f417f60/torch/distributed/fsdp/flat_param.py (L1005-L1006)
- For `NO_SHARD`, we consume the `_mp_shard` in the the unshard stream (for views) and in the default stream (for computation).
731f417f60/torch/distributed/fsdp/_runtime_utils.py (L304)
731f417f60/torch/distributed/fsdp/flat_param.py (L1256-L1261)
- We must call `record_stream(_mp_shard, current_stream)` when freeing so that the allocator knows about the usage in the current stream.
    - For sharded strategies, the free happens in `post_unshard()`, which runs in the unshard stream.
    - For `NO_SHARD`, the free happens in `post_reshard()`, which runs in the default stream.
    - Conveniently, for both, the current stream is the correct stream to synchronize. For `NO_SHARD`, the default stream waits for the unshard stream, so only recording in the default stream should suffice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91096
Approved by: https://github.com/rohan-varma
2022-12-19 19:45:34 +00:00
2a37ba8e81 [inductor] Add retry after benchmark test fails on CI (#90808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90808
Approved by: https://github.com/malfet
2022-12-19 18:10:55 +00:00
50ab2b702f move inputs to device on root module only (#91078)
1. No need to move inputs/activations to devices for every nested FSDP instance
2. it also breaks the case when some nested FSDP instances have newly added inputs/activations in the signatures of submodules wrapped by nested FSDP instances, args_tuple[0] and kargs_tuple[0] are not correct to get the inputs/activations for these nested instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91078
Approved by: https://github.com/mrshenli, https://github.com/rohan-varma
2022-12-19 17:49:05 +00:00
d6efd25d1e functionalization: check for undefined tensors in advanced indexing (#90791)
It looks like running code like `a[:, tensor_idx] = b` can results in:

(1) calling `index_put_()`
(2) passing (potential undefined) tensors as the indices to index_put_().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90791
Approved by: https://github.com/ezyang
2022-12-19 16:11:06 +00:00
440a3f2398 fix set_() with functionalization (#90722)
This should fix https://github.com/pytorch/pytorch/issues/90573

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90722
Approved by: https://github.com/ezyang
2022-12-19 16:11:06 +00:00
548960f68e Replace TORCHINDUCTOR_TRACE with TORCH_COMPILE_DEBUG in documentation (#91011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91011
Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/msaroufim
2022-12-19 14:45:27 +00:00
e5a48da664 Allow FSDP to have ignored modules out of wrapped root (#91079)
Motivations for this change:

1. TorchRec returns inconsistent results on `m.named_parameters()`
   and `m.m1.named_parameters()` if m1 is a `ShardedModule`. Basically,
   `ShardedModule` appears in `m.named_modules()`, but its parameters
   are not in `m.named_parameters()`. As a result, when we identify
   `ShardedModule` and pass them as `ignored_modules` to FSDP, FSDP
   complains about key error in `_get_ignored_params`.
2. If users are manually wrapping submodules with FSDP, it could be
   easier for them to keep a global set of ignored parameters, instead
   of create a new collection for every FSDP invocation.

Given the above two reasons, we allow FSDP to have ignored modules
out of the wrapped root module.

Differential Revision: [D42132394](https://our.internmc.facebook.com/intern/diff/D42132394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91079
Approved by: https://github.com/awgu
2022-12-19 14:28:25 +00:00
6686e9bc07 [Quant] Add fused LinearTanh module for onednn backend (#88923)
**Summary**
This PR adds fused `QLinearTanh` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown.

**Test plan**
python test_quantization.py TestStaticQuantizedModule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88923
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-19 13:42:25 +00:00
731f417f60 Use scalar implementation to keep the precision in linspace of integral types (#89048)
Fixes #88652

In the CPU implementation of linspace of integral types, `base` type in vectorized implementation is `int64_t`, which will drop the precision when `base` comes from a floating number. Meanwhile, its vectorized implementation tends to suffer from the catastrophic cancellation of floating point arithemtic since both the `base (start + step * idx)` and the `step` are not exact. Its scalar implementation is fine since start is always an integer and the result would be truncated to integer as well.

Therefore, in this PR , we will skip the vectorized implementation since the vec doesn't contribute to performance anyway. And now the behaviors between CPU and GPU are the same. In some cases, the results are the same as numpy's. In some other cases, the results are different from numpy's, but it is not related to the devices (CPU and GPU). https://github.com/pytorch/pytorch/issues/81996#issuecomment-1192980485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89048
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/albanD
2022-12-19 13:05:56 +00:00
f833880b2e Fix torch.distributed.run init connect timeout by comparing host with the current IP list (#90221)
Summary:
Pull Request: https://github.com/pytorch/pytorch/issues/79388

Fix torch.distributed.run init connect timeout by comparing `host` with the current IP list.

Test Plan: unit tests

Differential Revision: D41373962

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90221
Approved by: https://github.com/d4l3k
2022-12-19 12:58:23 +00:00
dfe916ca88 Dynamo comptime, with public ComptimeContext API (#90983)
This PR adds `@comptime`, a decorator that causes a given function to be executed at compile time when Dynamo is symbolically evaluating their program. To query the Dynamo state, we offer a public ComptimeContext API which provides a limited set of APIs for querying Dynamo's internal state. We intend for users to use this API and plan to keep it stable. Here are some things you can do with it:

* You want to breakpoint Dynamo compilation when it starts processing a particular line of user code: give comptime a function that calls breakpoint
* You want to manually induce a graph break for testing purposes; give comptime a function that calls unimplemented
* You want to perform a debug print, but you don't want to induce a graph break; give comptime a function that prints.
* You can print what the symbolic locals at a given point in time are.
* You can print out the partial graph the Dynamo had traced at this point.
* (My original motivating use case.) You want to add some facts to the shape env, so that a guard evaluation on an unbacked SymInt doesn't error with data-dependent. Even if you don't know what the final user API for this should be, with comptime you can hack out something quick and dirty. (This is not in this PR, as it depends on some other in flight PRs.)

Check out the tests to see examples of comptime in action.

In short, comptime is a very powerful debugging tool that lets you drop into Dynamo from user code, without having to manually jerry-rig pdb inside Dynamo to trigger after N calls.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90983
Approved by: https://github.com/jansel
2022-12-19 11:06:01 +00:00
ec748cbecd inductor: separate onednn fx fusion from overriders.py (#90890)
fix https://github.com/pytorch/pytorch/issues/90851.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90890
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-19 09:32:37 +00:00
4bf22fcfe2 add mixed data type support for GroupNorm (#81852)
1. If user uses amp to run bfloat16 models, `torch.autocast` will
keep module paramters in acc dtype which will leave `gamma` and`beta`
in float while input/output will be in bfloat16.

2. If user explicitly cast the model to bfloat16,
the input/output and gamma/beta will all be in bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81852
Approved by: https://github.com/jgong5, https://github.com/malfet
2022-12-19 07:59:40 +00:00
ea49e769f6 [Quant] Add fused linear-tanh op for onednn backend (#88879)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `linear-tanh` op for `onednn` backend, which will be used for int8 inference with `onednn` backend. Linear-tanh is found in models like CGAN.
Cannot call this op with other quantization backends otherwise an error is thrown.

**Test Plan**
python test_quantization.py TestQuantizedLinear

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88879
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-19 07:55:30 +00:00
17d860d03e Type torch._inductor.graph (#90987)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90987
Approved by: https://github.com/albanD, https://github.com/jansel
2022-12-19 07:50:46 +00:00
3916d7a575 Apply modernize-use-emplace to aten, c10, torch (#91077)
Apply clang-tidy check modernize-use-emplace. This is slightly more efficient by using an inplace constructor and is the recommended style in parts of the codebase covered by clang-tidy. This just manually applies the check to rest of the codebase. Pinging @ezyang as this is related to my other PRs he reviewed like #89000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91077
Approved by: https://github.com/ezyang
2022-12-19 07:49:56 +00:00
944519a468 Switch use_fake_tensor to True by default (#89663)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89663
Approved by: https://github.com/anjali411, https://github.com/Morgan77523
2022-12-19 07:24:06 +00:00
ce4900f3bb [cuDNN][cuDNN V8 API] Fix benchmark_limit ignoring failed kernels in FIND (#91032)
Currently the `torch.backends.cudnn.benchmark_limit` setting ignores the validity/status of proposed cuDNN frontend execution plans because we do not know if they will complete successfully until execution is attempted. However, there are rare cases where the majority of execution plans fail and a fallback plan is needed (e.g., in the case of extremely small pointer alignment on the input tensors). If the limit is too small to include a working fallback plan, we currently bail out prematurely without checking the plans exhaustively.

The fix is to defer applying the `benchmark_limit` setting until we are sure that plans will execute successfully, but this requires changes to the cuDNN frontend timing function. This PR adds a hacked version of the cuDNN frontend timing function for now, with the intent that we can switch to the upstream cuDNN frontend implementation once this functionality is added.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91032
Approved by: https://github.com/ngimel
2022-12-19 06:04:44 +00:00
856651dd55 Vectorize exmp1 and log1p (#91074)
- Fix the UT to capture the operators that have been defined in `CppOverrides` but not in `CppVecOverrides`
- Vectorize `log1p` and `expm1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91074
Approved by: https://github.com/jansel
2022-12-19 05:07:39 +00:00
490c1cf650 [Dynamo] Support torch.get_default_dtype (#89790)
Fixes https://github.com/pytorch/torchdynamo/issues/1930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89790
Approved by: https://github.com/soumith
2022-12-19 04:14:11 +00:00
1accd915a4 Re-enable optimizers (#90709)
Fixes
https://github.com/pytorch/pytorch/issues/90165
https://github.com/pytorch/torchdynamo/issues/328

Re-enables optimizer capture + compilation now that the dynamo slowdowns have been fixed

and it has speedups, numbers to come soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90709
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/yanboliang
2022-12-19 04:07:41 +00:00
9ca41a986c [Quant][FX] Lower QLinearLeakyReLU for onednn backend (#88668)
**Summary**
Add quantization mappings for `QLinearLeakyReLU` for int8 inference for onednn backend. The fusion and lowering is supported only in FX mode.

**Test plan**
python test_quantization.py TestQuantizeFx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88668
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-19 00:44:24 +00:00
8004f934cd Fix CSR with int32 indices to CSC conversion (#91061)
Fixes https://github.com/pytorch/pytorch/issues/91007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91061
Approved by: https://github.com/nikitaved
2022-12-18 13:53:25 +00:00
6be1e43367 [Checkpoint][Test] Add 2d DCP model state checkpoint test (save/load) (#91046)
Add test to test 2D checkpoint save/load functionality for model state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91046
Approved by: https://github.com/fduwjj
2022-12-18 08:20:33 +00:00
b72caf311d Introduce guardexpr, aot autograd guarding of duplicates into torch._guards (#90955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90955
Approved by: https://github.com/ezyang
2022-12-18 03:05:47 +00:00
212873c615 Add dynamic shapes benchmark accuracy to CI (#90444)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444
Approved by: https://github.com/voznesenskym
2022-12-17 11:17:20 +00:00
a1a2f548a9 [Composable API] Enable composable fully_shard submodules in replicate parent module (#90711)
To make sure `fully_shard` and `replicate` can work together, we need to check for each other in the implementation. This change adds the check in `replicate()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90711
Approved by: https://github.com/mrshenli
2022-12-17 09:28:38 +00:00
3229713cf2 [Checkpoint][nit] Fix test_fsdp_optim_state.py test name (#90943)
Fixing the test name that does not represent the actual test.

https://github.com/pytorch/pytorch/issues/90960
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90943
Approved by: https://github.com/fduwjj
2022-12-17 08:28:13 +00:00
e2377c8300 Revert "Add dynamic shapes benchmark accuracy to CI (#90444)"
This reverts commit 85db031e60d63cfdf5aaf8b30f54e01d56161a78.

Reverted https://github.com/pytorch/pytorch/pull/90444 on behalf of https://github.com/ezyang due to lint failing
2022-12-17 07:18:07 +00:00
85db031e60 Add dynamic shapes benchmark accuracy to CI (#90444)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90444
Approved by: https://github.com/voznesenskym
2022-12-17 06:39:45 +00:00
7c524221ba [reland3][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956)
…king (#87492)" (#90746)"

This reverts commit ff1bbc2773a31ab839438966266ed8ee206cb8c5.

This should be okay to merge now. The flakiness of HF models will be fixed by seeding the rng (https://github.com/pytorch/pytorch/pull/90936), and the numeric mismatch was root-caused to three decomps (still investigating why those decomps cause this) see https://github.com/pytorch/torchdynamo/issues/1985 for more detail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90956
Approved by: https://github.com/desertfire
2022-12-17 06:27:15 +00:00
78efde920e Revert "[inductor] add conv_transpose2d unary fusion for cpu in inference mode (#90265)"
This reverts commit d6fe9838d19a5dee60410b3b9212bf10a43105a4.

Reverted https://github.com/pytorch/pytorch/pull/90265 on behalf of https://github.com/ezyang due to earlier pr on stack got yanked, this one needs to go too
2022-12-17 05:07:59 +00:00
7b0ec67e34 [Quant][FX] Add backend config for onednn backend and fuse Linear-LeakyReLU (#88665)
**Summary**
Add backend config for onednn backend so that it can support more post op fusion for int8 inference. First `Linear - LeakyReLU` fusion is implemented based on previous PRs.

**Test plan**
python test_quantization.py TestFuseFx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88665
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-17 03:33:08 +00:00
bfa223aaa6 [Checkpoint] Fix checkpoint test test_fsdp_optim_state.py (#91036)
This PR:
1. Fix the test/distributed/fsdp/test_fsdp_optim_state.py according to change in FSDP.flatten_sharded_optim_state_dict() API.
2. Update docstring accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91036
Approved by: https://github.com/fegin
2022-12-17 03:02:31 +00:00
1d948787b7 Remove duplicate line (#91006)
Two [nearly](https://github.com/pytorch/pytorch/pull/90927) [identical](https://github.com/pytorch/pytorch/pull/90948) PRs both got merged without a reported merge conflicts? First time for everything
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91006
Approved by: https://github.com/kit1980
2022-12-17 02:20:36 +00:00
f7b384cc46 [reland][quant][pt2e] Add early prototype top level quantize_pt2e APIs (#91035)
Summary:
This PR introduces the top level APIs for quantization support in PyTorch 2.0 Export stack
* torch.ao.quantization.quantize_pt2e.prepare_pt2e
Takes a model that is captured by the PyTorch 2.0 export (torchdynamo full graph mode) and prepares the model for calibration
for post training quantization

* torch.ao.quantization.quantize_pt2e.convert_pt2e
Takes a calibrated model and converts that to a reference quantized model that can be lowered later to quantized operator libraries or delegation modules

Also added a backend config for the qnnpack_pt2e backend:
* torch.ao.quantization.backend_config.get_qnnpack_pt2e_backend_config

Note: everything related to quantize_pt2e are experimental (prototype), and we don't have any bc guarantees

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91035
Approved by: https://github.com/HDCharles
2022-12-17 02:15:53 +00:00
4ab81ae80d fix default partitioner: save sizes instead of tensor for backward when possible (#91012)
This should fix hf_Longformer, AllenaiLongformerBase, and tacotron2 with dynamic shapes. Example repro:
```
TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 python benchmarks/dynamo/torchbench.py --accuracy --backend aot_eager --training --only hf_Longformer
```

used to fail with:
```
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 1024, 12, 513]], which is output 0
 of AsStridedBackward0, is at version 6; expected version 4 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient,
with torch.autograd.set_detect_anomaly(True).
```

The problem is that:

(1) when we have a tensor from the forward, whose sizes are needed the backward, we were saving the actual tensor for backward, and directly grabbing the sizes off of it inside of the backward graph (bad for perf)

(2) If that tensor happens to be a graph input that gets mutated, we end up with the above error. Autograd yells at you if you try to save a tensor for backward, and later mutate it.

I confirmed that this problem doesn't happen for the min cut partitioner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91012
Approved by: https://github.com/ezyang
2022-12-17 02:06:10 +00:00
1609b954f8 Save and restore tracked_fakes (#90995)
This fixes BERT_pytorch and some other models.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90995
Approved by: https://github.com/voznesenskym
2022-12-17 01:36:35 +00:00
ed589dd8e4 [functorch] add composition-of-3-transform tests for autograd_function (#90962)
This PR adds the following OpInfo tests:
- vmap x vjp x vmap
- vjp x vmap x vmap
- vjp x vjp x vmap

These OpInfo tests only run for the autograd_function_db. In general,
testing composition of two transforms is sufficient to convince
ourselves that functorch works on a given operator.

The autograd.Function testing (especially the upcoming
generate_vmap_rule) didn't feel rigorous enough to me, so I added these
additional tests to convince myself.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90962
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-12-17 00:43:44 +00:00
e1c799ff82 Fix comment about get_fw_grad_mode() only being used in custom Function (#90790)
Addresses
https://github.com/pytorch/pytorch/pull/90240#issuecomment-1349596445

This was the only comment I found after grepping the codebase, but
please let me know if I missed others.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90790
Approved by: https://github.com/soulitzer
2022-12-17 00:43:44 +00:00
ffa37c9fca Add VmapInterpreter.randomness (in pyfunctorch) provide it in info object (#90789)
This PR:
- adds VmapInterpreter.randomness. This returns the randomness option
the user provided in vmap(..., randomness=...)
- adds randomness in the info object passed to the vmap staticmethod of
autograd.Function. This is so that the user can handle random operations
on their own terms (if randomness="error", and if the autograd.Function
has random operations, then it is the user's responsiblity to raise an
error).

Test Plan:
- updated unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90789
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-12-17 00:43:43 +00:00
8bd959e462 set -Winconsistent-missing-override for builds (#89851)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/89851).
* #89852
* __->__ #89851

set -Winconsistent-missing-override for builds

Summary: This has triggered internally on some PyTorch code.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89851
Approved by: https://github.com/malfet
2022-12-17 00:30:06 +00:00
93cb580677 lint transformer.py (#91048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91048
Approved by: https://github.com/ZainRizvi, https://github.com/kit1980, https://github.com/ezyang
2022-12-16 23:51:42 +00:00
5d70d12812 [dynamo] turn torch.backends.cudnn.is_acceptable into a constant (#90323)
Tracing `torch.backends.cudnn.is_acceptable(Tensor) -> bool:` fails with:

```
...
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/functions.py", line 196, in call_function
    return super(UserFunctionVariable, self).call_function(tx, args, kwargs)
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/functions.py", line 67, in call_function
    return tx.inline_user_function_return(
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 426, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 1698, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 1752, in inline_call_
    tracer.run()
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 485, in run
    and self.step()
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 455, in step
    getattr(self, inst.opname)(inst)
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 281, in wrapper
    return inner_fn(self, inst)
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 912, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/symbolic_convert.py", line 389, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/torch.py", line 431, in call_function
    tensor_variable = wrap_fx_proxy(
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/builder.py", line 662, in wrap_fx_proxy
    return wrap_fx_proxy_cls(
  File "/scratch/dberard/dynamo38/pytorch/torch/_dynamo/variables/builder.py", line 820, in wrap_fx_proxy_cls
    raise AssertionError(
AssertionError: torch.* op returned non-Tensor bool call_function <function is_acceptable at 0x7f00deefb790>
```

So instead, evaluate `is_acceptable()` and convert the result to a constant. The result of `is_acceptable(tensor) -> bool` depends on:
* dtype/device of the input tensor (this should already be guarded)
* properties of the build & whether cudnn is available
* some global state that gets initialized during the first call to `torch.backends.cudnn._init()` (this is NOT guarded in this PR)

Note: this fixes tts_angular with FSDP. This was an issue with FSDP because FSDP modules are interpreted as UnspecializedNNModules, and UnspecializedNNModules try to inline calls. In comparison, NNModules (e.g. when the tts_angular model is not wrapped in FSDP) do not inline calls and instead evaluate subsequent calls. In subsequent calls, cudnn.is_acceptable would be skipped by eval_frame.py:catch_errors because it is not in an allowlist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90323
Approved by: https://github.com/jansel
2022-12-16 23:26:54 +00:00
7d3f2b7902 Revert "add conv_transpose2d pointwise(unary) fusion kernel (#90264)"
This reverts commit 85698d0ac4686c10ba527f94724de61b4a856027.

Reverted https://github.com/pytorch/pytorch/pull/90264 on behalf of https://github.com/osalpekar due to build breakage on feed pytorch build package internally
2022-12-16 23:16:59 +00:00
7a0f29b776 Allow Process Group to support multiple backends (#88330) (#90997)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88330

### Implementation
Move backend-specific (NCCL, Gloo, etc) collective implementations to corresponding `Backend` class. Update ProcessGroup to support multiple backends and use dispatcher to calls backends based on tensor device type.

### Changes

#### c++ changes (ProcessGroup files, `Ops.cpp`, `init.cpp`)
- Update pybind definitions for new process group base class and new backend class
- Update pybinded backend class with collective definitions to keep BC with Python PG instances (e.g. `dist.ProcessGroupGloo`, `dist.ProcessGroupNCCL`) which are used in tests
- Switch `ProcessGroupGloo`, `ProcessGroupNCCL`, `ProcessGroupMPI`, `ProcessGroupUCC` to derive from the `Backend` class.
- Update CPU/CUDA `Ops.cpp` and `OpsImpl.cpp` to perform this dispatching by querying the backend using the device type
- Update internal dispatched implementation of `barrier` to use a tensor which allows operation to be dispatched.
- Update `allgather` collective to use `TensorList`. For some reason it was using the default implementation of `allgather` rather than dispatching it correctly. I still don't understand why and had originally filed an issue in 85122.

#### python changes (`distributed_c10d.py`, test files)
- Add BackendConfig class to specify the default configurations of backends and `get_backend_config()` API
- `get_backend()` deprecation warning
- `init_process_group` how returns a generic `ProcessGroup` object, it contains a list of backends (the ones stated above) which it will dispatch operations to.
- `new_group` updated to return the same as above
- Update `test_c10d_gloo.py`, Update `DistributedDataParallelTest` to use `init_process_group`, Update `ReducerTest`, update `test_broadcast_coalesced_gloo` to move from PG instance and gloo options
- Update `test_c10d_nccl.py`, Update `DistributedDataParallelTest` to use `init_process_group`
- Specific tests updated: `test_Backend_enum_class`

### Changes missing
- lazy initialization of backends
- support parsing of BackendConfig

### open questions
- Pure Python PG extensions (https://github.com/pytorch/pytorch/pull/66338)

# Example

This is a basic script (using 2 backends within a process group)

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 basic_scenario.py
import torch.distributed as dist
import torch
import os

if __name__ == "__main__":
    rank = os.environ.get("RANK")
    # initialize with both gloo and nccl
    dist.init_process_group()
    # with gloo
    dist.all_reduce(torch.tensor([1.0]))
    print(f"Rank {rank} finished")
    # with nccl
    dist.all_reduce(torch.tensor([1.0], device=f"cuda:{rank}"))
```

Test Plan: Imported from OSS

Differential Revision: D42069829

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90997
Approved by: https://github.com/awgu, https://github.com/fduwjj
2022-12-16 23:15:00 +00:00
93ac8c4aeb [dynamo] Refactor how autocast parameters are binded (#90953)
Summary: Use `inspect.signature` for unified args handling

Test Plan: `test_dynamo`

Differential Revision: D42078621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90953
Approved by: https://github.com/brad-mengchi
2022-12-16 23:12:49 +00:00
4fa8d774b8 Add macro C10_AS_INTARRAYREF_SLOW (#90675)
This makes it easier to narrow down who is throwing the error,
instead of having to use gdb.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: [D42088781](https://our.internmc.facebook.com/intern/diff/D42088781)
2022-12-16 15:10:35 -08:00
ba7aeac37b Revert "[cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022)"
This reverts commit eecd621f06d97d51072d924749a5d54b081295a0.

Reverted https://github.com/pytorch/pytorch/pull/89022 on behalf of https://github.com/ngimel due to breaks some convolution configurations #91025
2022-12-16 23:06:35 +00:00
4438b019a8 Fix non-existing parameters in docstrings in torch/ao (#90875)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90875
Approved by: https://github.com/clee2000
2022-12-16 22:34:33 +00:00
ee2475869c ModuleInfo-based tests for AOTAutograd (#90980)
Adds a set of generated tests for `AOTAutograd` using the `ModuleInfo` db, analogous to the `OpInfo`-based tests. Includes the following changes:

* Adds a `TestEagerFusionModuleInfo` test class, with both symbolic and non-symbolic tests, just like the OpInfo tests.
    * Test logic "functionalizes" the module under test and calls into the now-factored-out verification logic the OpInfo tests use to compare compiled vs. non-compiled function outputs / grads.
* Adds a `decorateForModules(decorator, module_set)` utility to `test/functorch/common_utils.py` to handle xfails, skips, etc. The pre-existing logic is specific to ops, and I didn't want to duplicate all that, so I kept additions minimal with this function.
    * Bunch of xfails to get everything passing; haven't looked deeply into all these yet. #90500 is relevant for the RNN failures.
* Fixes a bug in the `ModuleInfo` entry for `NLLLoss` to ensure sample input has the requested `requires_grad` setting (was causing spurious test failures).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90980
Approved by: https://github.com/ezyang
2022-12-16 21:43:34 +00:00
3226209636 LSTM SymInt-aware changes & meta registration (cuDNN) (#90944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90944
Approved by: https://github.com/ezyang
2022-12-16 21:42:32 +00:00
512ec181ec Introduce causal mask (#90508)
Summary: Introduce causal mask

This PR introduces a causal mask option _causal_mask (as well as causal mask detection if attn_mask is provided), since current custom kernels do not support arbitrary masks.

Test Plan: sandcastle & github ci/cd

Differential Revision: D41723137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90508
Approved by: https://github.com/albanD
2022-12-16 21:39:42 +00:00
e689c50922 Don't recompute var in bn decomp (#90984)
Fixes https://github.com/pytorch/torchdynamo/issues/1988
Repeated `var` computation is not CSE'd for some reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90984
Approved by: https://github.com/Chillee
2022-12-16 21:38:49 +00:00
e4de6ed6bb functorch: non-contig samples for test_grad (#90990)
Ref: https://github.com/pytorch/functorch/issues/1029

Before PR: (Time: ~30s)
```
================================================= 1052 passed, 264 skipped, 17373 deselected, 9 xfailed in 29.09s =================================================
```

After PR: (Time: ~43s)
```
================================================ 1042 passed, 264 skipped, 17373 deselected, 19 xfailed in 43.13s =================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90990
Approved by: https://github.com/zou3519
2022-12-16 21:27:44 +00:00
5ea418bf63 [FSDP][3/N] Move fsdp_modules(root_only=True) -> _get_fsdp_root_states() (#90862)
- This PR introduces `_get_fsdp_root_states(state: _FSDPState, module: nn.Module)` to return all states that are FSDP root in the module tree rooted at `module`.
   - This requires passing in both `state` and `module` because it must call `_lazy_init()` to check for root-ness, which requires that signature.
- This PR moves the one internal usage of `FullyShardedDataParallel.fsdp_modules(root_only=True)` to use `_get_fsdp_root_states()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90862
Approved by: https://github.com/rohan-varma
2022-12-16 21:27:27 +00:00
67ef88af37 Revert "[Quant] onednn backend switch to ideep new api without affacting performance (#90354)"
This reverts commit 9b89ff0923251d2a30ceccf61120d051a687557c.

Reverted https://github.com/pytorch/pytorch/pull/90354 on behalf of https://github.com/osalpekar due to Breaking core pytorch contbuilds internally with function not found errors- more details in D42081737
2022-12-16 21:15:22 +00:00
7a683eaeb8 aot_autograd: add assert for functional-only graph (#88816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88816
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-12-16 21:04:36 +00:00
c83ff1ea08 [GHA][BE] Update to newer checkout action (#90969)
This one uses node-16 so it would not spew that many warnings

Also, change `build` to `test` in `_binary_test_linux` to fix https://github.com/pytorch/pytorch/issues/83044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90969
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2022-12-16 20:56:29 +00:00
bd94ee66ea [quantized] [executorch] typo (#89960)
Summary: Inefficient impl in python

Test Plan: buck2 test mode/dev //caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quantized_embedding_byte (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'

Differential Revision: D41627744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89960
Approved by: https://github.com/jerryzh168
2022-12-16 19:49:09 +00:00
68805b565a Include dispatch key in wrapper symbol name (#90674)
When looking at gdb traces, this makes it easier to tell that
you're looking at the CPU wrapper vs CUDA wrapper, etc.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D42088744](https://our.internmc.facebook.com/intern/diff/D42088744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90674
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-12-16 19:36:32 +00:00
6bc6fb21db Revert "[reland2][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956)"
This reverts commit 8bc38ae4e2037ae42813d552e5d412db77167bc0.

Reverted https://github.com/pytorch/pytorch/pull/90956 on behalf of https://github.com/desertfire due to Causing TIMM model failures
2022-12-16 19:28:05 +00:00
8cd1808dbf [FSDP] Introduce "fully sharded module"; remove comm. module (#90933)
This PR removes the "communication module" (comm. module / `comm_module`) concept from the FSDP code base since it causes disproportionate confusion compared to its benefit for now.

Instead, we introduce the term "fully sharded module" as the single concept to unify the wrapper and non-wrapper code paths. The definition is presented in a note at the top of `flat_param.py`. I reproduce it here:

---
We define the **"fully sharded module"** to be the original `nn.Module` that owns a `FlatParamHandle`. It is the *single* module logically responsible for the *single* unshard/reshard pair for the handle's `FlatParameter` for a given forward or backward pass. The fully sharded module should be passed to the `FlatParamHandle` constructor.

For the wrapper code path:
- The `FullyShardedDataParallel` module wrapping the fully sharded module runs the unshard/reshard on behalf of the fully sharded module by overriding `nn.Module.forward`.
- The fully sharded module is exactly the module passed to the `FullyShardedDataParallel` constructor's `module` argument and is saved in `_fsdp_wrapped_module`.

For the non-wrapper code path:
- Hooks registered on the fully sharded module run the unshard/reshard.
- The fully sharded module may either be the direct argument to `fully_shard` or a submodule chosen by the provided wrapping policy.
---

After this PR, `handle.flat_param._fqns`, `_param_infos`, and `_shared_param_infos` all prefix names from the same module, namely the fully sharded module. This should make state dict less confusing.

---
As an example, consider:
```
mod: Module(
  sub1: Submodule(
    subsub1: Subsubmodule(),
    subsub2: Subsubmodule(),
  ),
  sub2: Submodule(
    subsub1: Subsubmodule(),
    subsub2: Subsubmodule(),
  ),
)
```
For wrapper FSDP manual wrap:
```
mod.sub1 = FSDP(mod.sub1)
mod.sub2 = FSDP(mod.sub2)
mod = FSDP(mod)
```
For wrapper FSDP auto wrap:
```
mod = FSDP(mod, auto_wrap_policy=ModuleWrapPolicy({Submodule}))
```
(WIP) For non-wrapper FSDP manual wrap:
```
fully_shard(mod.sub1)
fully_shard(mod.sub2)
fully_shard(mod)
```
For non-wrapper FSDP auto wrap:
```
fully_shard(mod, policy=ModuleWrapPolicy({Submodule}))
```
The fully sharded module **in all cases** are `mod`, `mod.sub1`, `mod.sub2`, and notably, `subsub1` and `subsub2`s are not fully sharded modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90933
Approved by: https://github.com/rohan-varma
2022-12-16 18:45:52 +00:00
b0cda0b38c LSTM SymInt-aware changes & meta registration (non-cuDNN CUDA) (#90701)
Adds meta registrations for cuDNN and vanilla CUDA ops underneath `lstm()` and makes the logic SymInt-aware.
TODO:
* cuDNN side does some [nasty stuff](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cudnn/RNN.cpp#L1567) with buffers; this needs larger redesign to figure out
* Indicate that AOT Autograd can be used when an LSTM is present (remove the check for this once it's fully supported)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90701
Approved by: https://github.com/ezyang
2022-12-16 18:08:45 +00:00
a10b3ce876 generate device context managers in inductor code (#90934)
Fixes https://github.com/pytorch/torchdynamo/issues/1717, https://github.com/pytorch/torchdynamo/issues/1990

<s>TODO: add test with multiple devices, figure out extra context initialization</s>

Problems:
<s>It still initializes context on 0-th device that it shouldn't, I'll take a look where that happens and fix before landing</s>
It adds a python device context manages, that is absurdly slow and takes ~2.5 us (should be nanoseconds). That's not a problem for real models, because it'll be called just once, but it is a bit of an inconvenience for microbenchmarking, we should make that context manager more performant (won't fix in this PR)
It still can have bugs for graphs that run on multiple devices and can have buffers incorrectly shared between multiple device by memory reuse, if that happens that'll need to be solved separately.

Generated code:
```
def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    with torch.cuda.device(1):
        buf0 = empty_strided((4, ), (1, ), device='cuda', dtype=torch.float32)
        stream1 = get_cuda_stream(1)
        triton_fused_div_0.run(arg0_1, arg1_1, buf0, 4, grid=grid(4), stream=stream1)
        del arg0_1
        del arg1_1
        return (buf0, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90934
Approved by: https://github.com/wconstab
2022-12-16 18:03:39 +00:00
9d8fa78d2c tools: Update clang-tidy hash (#91008)
clang-tidy was updated due to a BE change (https://github.com/pytorch/test-infra/pull/1309), this updates the hash to the latest version through an [automated github action](https://github.com/pytorch/test-infra/actions/runs/3713717677) causing failures since the s3 hash is hardcoded here;

To resolve failures like: ([logs](https://github.com/pytorch/pytorch/actions/runs/3714626185/jobs/6298779282#step:5:81))

```
INFO: Downloaded clang-tidy successfully.
WARNING: Found binary hash does not match reference!

Found hash: e4a1537ee997aa486a67bcc06d050b1aa6cfb14aa3073c08f19123ac990ab2f7
Reference hash: 4[93](https://github.com/pytorch/pytorch/actions/runs/3714626185/jobs/6298779282#step:5:94)43a448fcb75cd1e0fb9d6b1f6c2ef4b008b6f91d6ff899d4ac6060f5e52a5

Deleting .lintbin/clang-tidy just to be safe.

CRITICAL: Downloaded binary clang-tidy failed its hash check
CRITICAL: Unable to initialize clang-tidy
error:        lint initializer for 'CLANGTIDY' failed with non-zero exit code
```

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91008
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 17:28:02 +00:00
01e7f46215 Ensure sorted indices from the CSR->BSR conversion (#90918)
Fixes https://github.com/pytorch/pytorch/issues/90910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90918
Approved by: https://github.com/cpuhrsch
2022-12-16 15:49:48 +00:00
634555d981 [ONNX] Auto test based on OpInfo (#86182)
This change introduces a mechanism to test onnx export based on sample inputs registered in OpInfo, similar to how MPS and other components of pytorch are tested. It provides test coverage on ops and dtypes previously unattainable with manually created test models. This is the best way for us to discover gaps in the exporter support, especially for ops with partial existing support.

This test is adapted from https://github.com/pytorch/pytorch/blob/master/test/test_mps.py

This PR also

- Update sqrt to support integer inputs to match pytorch behavior
- Add pytest-subtests for unittest subtests support in the new test file

I only enabled very few ops: `t`, `ceil` and `sqrt` because otherwise too many things will fail due to (1) unsupported dtypes in the exporter (2) unimplemented dtype support in onnxruntime (3) unexpected input to verification.verify.

Subsequent PRs should improve `verification.verify` first for it to accept any legal input to a pytorch model, then incrementally fix the symbolic functions to enable more test cases.

Fixes #85363
Design #88118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86182
Approved by: https://github.com/BowenBao
2022-12-16 14:43:41 +00:00
8bc38ae4e2 [reland2][dynamo] Revert "Revert "[reland][dynamo] use optimizers correctly in benchmar… (#90956)
…king (#87492)" (#90746)"

This reverts commit ff1bbc2773a31ab839438966266ed8ee206cb8c5.

This should be okay to merge now. The flakiness of HF models will be fixed by seeding the rng (https://github.com/pytorch/pytorch/pull/90936), and the numeric mismatch was root-caused to three decomps (still investigating why those decomps cause this) see https://github.com/pytorch/torchdynamo/issues/1985 for more detail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90956
Approved by: https://github.com/desertfire
2022-12-16 13:33:38 +00:00
c2c14f9597 Sparse compressed mm: fix for orthogonal inputs (#90917)
Fixes https://github.com/pytorch/pytorch/issues/90836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90917
Approved by: https://github.com/cpuhrsch
2022-12-16 13:08:00 +00:00
4dd3de23dd Sparse compressed mm: fix for empty inputs (#90763)
Fixes [#90693
](https://github.com/pytorch/pytorch/issues/90693)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90763
Approved by: https://github.com/cpuhrsch
2022-12-16 12:33:57 +00:00
3e44fcee2f [FSDP][2/N] Move fsdp_modules(root_only=False) -> _get_fsdp_states() (#90861)
This PR migrates all internal usages of `FullyShardedDataParallel.fsdp_modules(root_only=False)` to `_get_fsdp_states()`. This is to unify the code paths for composable and wrapper FSDP.

This PR _does not_ change the usages in test files. This is because we should revisit those usages separately as a way to track which functionality for which we have not tested composable FSDP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90861
Approved by: https://github.com/rohan-varma
2022-12-16 12:21:47 +00:00
673c25d45a [FSDP][Easy] Rename entry -> fsdp_module to be more descriptive (#90864)
I started refactoring unit tests to use `_get_fsdp_states()` instead of `FullyShardedDataParallel.fsdp_modules()` but realized we should not do that for now. This is just a change I made while doing that. `entry` is not descriptive. Let us explicitly say `fsdp_module`. `for fsdp_module in FSDP.fsdp_modules(module)` is a proper idiom.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90864
Approved by: https://github.com/rohan-varma
2022-12-16 12:16:08 +00:00
95ee5fecb1 [FSDP][1/N] Add _get_fsdp_states() (#90860)
- This PR introduces `_get_fsdp_states(module: nn.Module) -> List[_FSDPState]` to prepare for `fully_shard` manual "wrapping".
    - ~~I place it in `_runtime_utils.py`, not `_common_utils.py`, because in a follow-up PR, I will add `_get_root_fsdp_states()`, which requires `_lazy_init()`. I concluded that it would be preferred to have both of these getters be in the same place than to have them split, even if that means that `_get_fsdp_states()` is in `_runtime_utils.py`.~~ Due to circular import issues, I think I should still put it in `_common_utils.py`.
- This PR changes `FullyShardedDataParallel.fsdp_modules()` to be backed by `_get_fsdp_states()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90860
Approved by: https://github.com/rohan-varma
2022-12-16 12:15:42 +00:00
06533a2eb7 [Inductor] actually check replacements in AutogradMonkeypatch (#90901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90901
Approved by: https://github.com/ezyang
2022-12-16 11:54:02 +00:00
9d79d09b6e Make it easier to find troubleshooting steps (#90927)
People's general tendency is to read from top to bottom. Leverage that at the right moment to help them realize that there's a troubleshooting section they can use if they get stuck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90927
Approved by: https://github.com/ZainRizvi
2022-12-16 11:04:46 +00:00
ad1b04c4a9 Revert "[reland][quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90971)"
This reverts commit 7dd5e554971411cbb50fc2eb157057c1e8a0de63.

Reverted https://github.com/pytorch/pytorch/pull/90971 on behalf of https://github.com/ezyang due to still broke tons of master jobs sorry
2022-12-16 09:29:39 +00:00
ddf5b68dcb Nuttall window (#90103)
Relates #85366
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90103
Approved by: https://github.com/lezcano
2022-12-16 09:05:53 +00:00
53e71fad8f Add shape_env guards to tracing context (#90876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90876
Approved by: https://github.com/Chillee, https://github.com/ezyang
2022-12-16 09:05:05 +00:00
a01c1ee594 [ao] making _is_activation_post_process private with BC (#90554)
same function in observer and quantize, consolidated to a
single function

note: this is a recreation of D40709276 which caused severa breakages due to not maintaining BC for models with cached code with calls to the old function name

Differential Revision: [D41793604](https://our.internmc.facebook.com/intern/diff/D41793604/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41793604/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90554
Approved by: https://github.com/jcaip
2022-12-16 08:09:33 +00:00
6ea93b2295 [Quant] Add fused LinearLeakyReLU module for onednn backend (#88661)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `QLinearLeakyReLU` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown.

**Test plan**
python test_quantization.py TestStaticQuantizedModule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88661
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-16 07:28:13 +00:00
ffd0b15a49 Add support for keep-going label (#90902)
This makes run_test.py keep going even on failure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 06:47:06 +00:00
c6cba1865f [Docker] Install Trition deps (#90841)
Triton needs a working gcc, so install one from apt
Also, copy `ptxas` and `cuda.h` from conda to `/usr/local/cuda`
Add `torchaudio` to the matrix
Fix typo in workflow file

Fixes https://github.com/pytorch/pytorch/issues/90377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90841
Approved by: https://github.com/ngimel
2022-12-16 06:35:43 +00:00
7dd5e55497 [reland][quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90971)
Summary:
This PR introduces the top level APIs for quantization support in PyTorch 2.0 Export stack
* torch.ao.quantization.quantize_pt2e.prepare_pt2e
Takes a model that is captured by the PyTorch 2.0 export (torchdynamo full graph mode) and prepares the model for calibration
for post training quantization

* torch.ao.quantization.quantize_pt2e.convert_pt2e
Takes a calibrated model and converts that to a reference quantized model that can be lowered later to quantized operator libraries or delegation modules

Also added a backend config for the qnnpack_pt2e backend:
* torch.ao.quantization.backend_config.get_qnnpack_pt2e_backend_config

Note: everything related to quantize_pt2e are experimental (prototype), and we don't have any bc guarantees

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90971
Approved by: https://github.com/HDCharles
2022-12-16 06:24:28 +00:00
e48c91688b DebugInterpreter works with symbolic shapes now, plus test (#90913)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90913
Approved by: https://github.com/voznesenskym
2022-12-16 05:22:56 +00:00
67436f621a Add utility for binding symbols based on arguments passed to placeholders (#90912)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90912
Approved by: https://github.com/voznesenskym
2022-12-16 05:22:56 +00:00
bbea58d500 Stop using GraphArgs for shape env guard source tracking (#90911)
GraphArgs worked fairly well, but it was still missing sources
sometimes.  Now, we maintain an auxiliary data structure which we
MUST populate whenever we fakeify a tensor / allocate a bare SymInt.
This should guarantee once and for all that every symbol is available.
Should fix swin_base_patch4_window7_224.

While I was at it, I moved fakeification utility back to builder
as it was only used at once call site.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90911
Approved by: https://github.com/voznesenskym
2022-12-16 05:22:56 +00:00
eef019c14a Lint rule to forbid direct use of logging.info/etc APIs (#90907)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90907
Approved by: https://github.com/jansel
2022-12-16 05:13:51 +00:00
82a191313e Revert "Add support for keep-going label (#90902)"
This reverts commit 855f4b7d2470a349a0b61c5d20e3eb21414a5fb5.

Reverted https://github.com/pytorch/pytorch/pull/90902 on behalf of https://github.com/huydhn due to This change breaks trunk where, unlike PR, there is no label
2022-12-16 05:07:49 +00:00
2f6ada84b4 [inductor] Remove flag of bmm's dim m and n in shape padding (#90937)
Summary: There was an OOM issue in two internal models when turning on padding bmm with dim m and n with shape padding optimization, so added a flag to turned on/off for the internal models. The issue was gone now so removing the flag.

Differential Revision: D42074557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90937
Approved by: https://github.com/ngimel
2022-12-16 04:29:12 +00:00
5e3bc1975b Add any_chain() in upstream (#90949)
Summary: I need any chain. Current chain is logical AND.

Test Plan: arc lint, follow-up diffs use it.

Differential Revision: D42078837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90949
Approved by: https://github.com/angelayi
2022-12-16 04:09:10 +00:00
855f4b7d24 Add support for keep-going label (#90902)
This makes run_test.py keep going even on failure.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90902
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-12-16 04:03:52 +00:00
4372dbb89f use pytree to allow any input format for cuda graph (#90941)
Summary:
1. use pytree to allow any input format for make_graphed_callables
2. add allow_unused_input argument for make_graphed_callables

Test Plan: buck2 test mode/dev-nosan  //caffe2/test:cuda --  --print-passing-details

Differential Revision: D42077976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90941
Approved by: https://github.com/ngimel
2022-12-16 03:01:47 +00:00
d9d263efb9 Revert "[Quant] Add fused LinearLeakyReLU module for onednn backend (#88661)"
This reverts commit 353c2e7d39c2c4d0c3e1b8c4d7338e19c7b02f57.

Reverted https://github.com/pytorch/pytorch/pull/88661 on behalf of https://github.com/Xia-Weiwen due to This is breaking tests. Need to rebase.
2022-12-16 02:58:26 +00:00
d3e0bcc796 pin multipy (#90942)
Pins multipy to prevent breakages in the torch CI due to multipy changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90942
Approved by: https://github.com/huydhn
2022-12-16 02:49:39 +00:00
d8c1872cc3 Make it easier to find troubleshooting steps (#90948)
People's general tendency is to read from top to bottom. Leverage that at the right moment to help them realize that there's a troubleshooting section they can use if they get stuck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90948
Approved by: https://github.com/soumith, https://github.com/ZainRizvi
2022-12-16 02:13:28 +00:00
9d523616b3 fix segfault for EmbeddingBag on CPU slow path when include_last_offset is true (#90358)
This PR is to fix the segfault reported at https://github.com/pytorch/pytorch/issues/89677, this is a `double free` issue caused by `invalid read`.

The reported issue broke at slow path for `EmbeddingBag` on float32, at [EmbeddingBag.cpp#L451](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L451)

Root cause is that `add_indices` has index which exceeds range of `output_data`, for the reported case.

The offsets are given as
```
{0,  6, 12, 15, 25, 32, 40, 42, 46, 53, 53}
```

The `indices` has 55 elements and `offsets[-1] != indices.size(0)`.

When `include_last_offset` is true, the `output` will be in the shape of {offsets.size(0) - 1, weight.sizes()[1]}, which will be {10, 5}.
Originally, `add_indices` will be (i re-arange the 1D tensor by rows, so here 10 rows in total)
```
### this is 55 elements
  0 0 0 0 0 0
  1 1 1 1 1 1
  2 2 2
  3 3 3 3 3 3 3 3 3 3
  4 4 4 4 4 4 4
  5 5 5 5 5 5 5 5
  6 6
  7 7 7 7
  8 8 8 8 8 8 8
  10 10
```
The last row has index of 10 which is out of range of output tensor whose size is [10, 5].

The reason is `make_offset2bag` at [EmbeddingBag.cpp#L66](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L66) would give the following `offset2bag`:
```
### this is 55 + 1 elements:
0 0 0 0 0 0 1
0 0 0 0 0 1
0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
0 1
0 0 0 1
0 0 0 0 0 0 2
0 0
```

Notice for index 53, it is added twice.

The fix is ignore the last index from `offsets` when `include_last_offset` is true, also this behavior aligns with CUDA, quote from https://github.com/pytorch/pytorch/pull/57208#issuecomment-1021727378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90358
Approved by: https://github.com/ezyang
2022-12-16 02:08:14 +00:00
353c2e7d39 [Quant] Add fused LinearLeakyReLU module for onednn backend (#88661)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `QLinearLeakyReLU` module for onednn backend, which will be used for int8 inference with onednn backend. Cannot call this module with other quantization backends otherwise an error is thrown.

**Test plan**
python test_quantization.py TestStaticQuantizedModule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88661
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-16 01:54:10 +00:00
750576a50a Revert "Include dispatch key in wrapper symbol name (#90674)"
This reverts commit e87370133cb839a2c934eeafb002dbe8c1190f1a.

Reverted https://github.com/pytorch/pytorch/pull/90674 on behalf of https://github.com/osalpekar due to executorch breakage internally, more details in [D42051698](https://www.internalfb.com/diff/D42051698)
2022-12-16 01:05:57 +00:00
f660d62ddc Make dynamo.export preserve user input/output format (#90884)
Currently, dynamo flattens the user input so when user reuses the input they use for tracing, exported graph wouldn't work as it would expect flat args.  This PR changes this behaviour by explicitly wrapping the dynamo produced graph with correct user input/output format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90884
Approved by: https://github.com/zhxchen17, https://github.com/voznesenskym
2022-12-16 00:57:09 +00:00
31b8dc7542 Revert "[JIT] Frozen Graph Linear-BatchNormNd Folding (#86706)"
This reverts commit e585156c59767ff13306a31d8c31ffe7a33439dc.

Reverted https://github.com/pytorch/pytorch/pull/86706 on behalf of https://github.com/davidberard98 due to possibly causing internal build failures, will revert and investigate later
2022-12-16 00:49:54 +00:00
535b0e37dd Suppress RecursionError in sympy; fix logging (#90904)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90904
Approved by: https://github.com/Chillee
2022-12-16 00:34:24 +00:00
140a3139d6 Revert "Add macro C10_AS_INTARRAYREF_SLOW (#90675)"
This reverts commit 8090cb5386dccf4cf341aea585c793dfbb6c6002.

Reverted https://github.com/pytorch/pytorch/pull/90675 on behalf of https://github.com/osalpekar due to broke internal acc_tensor implementation in training_platform contbuild. See [D42052101](https://www.internalfb.com/diff/D42052101) for details.
2022-12-16 00:30:50 +00:00
9259933edd [ao][fx] fixing public v private prepare.py (#88398)
Summary: made _DO_NOT_OBS_DTYPE_LIST, _add_matched_node_name_to_set,
_get_arg_target_is_dynamic_as_input_to_node, _get_arg_target_is_dynamic_as_input_to_node,
_get_arg_target_dtype_as_input_to_node,
_get_arg_target_dtype_as_output,
_get_target_activation_dtype_for_node,
_get_standalone_module_configs,
_insert_observer,
_is_activation_post_process_node,
_is_input_arg_dtype_supported_by_backend,
_is_observer_in_same_graph,
_is_output_dtype_supported_by_backend,
_maybe_insert_input_equalization_observers_for_node,
_maybe_insert_input_observer_for_arg_or_kwarg,
_maybe_insert_input_observers_for_node,
_maybe_insert_observers_before_graph_output,
_maybe_insert_output_observer_for_node,
_maybe_make_input_output_share_observers,
_maybe_propagate_dtype_for_node,
_qat_swap_modules,
_remove_output_observer,
_run_prepare_fx_on_standalone_modules,
_save_state,
_swap_custom_module_to_observed private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015542](https://our.internmc.facebook.com/intern/diff/D41015542)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88398
Approved by: https://github.com/jcaip
2022-12-16 00:30:41 +00:00
f3da157ce3 Reset rng in hf before loading a model (#90936)
Reset the rng in hf before generating input and loading model, this makes the huggingface inputs+weights deterministic depending on the seed of the rng. This matches the behavior of the other test suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90936
Approved by: https://github.com/desertfire
2022-12-16 00:15:27 +00:00
d04e3c994f [FSDP] Fix input grad propagation when using param mixed precision (#90921)
For parameter mixed precision, we cast the inputs to the low precision parameter dtype. If the input has tensors that require gradient, then we must cast them in place in order for them to receive a gradient. The cast should be tracked by autograd (e.g. with `grad_fn` equal to `ToCopyBackward0`). This removes the `torch.no_grad` context when calling `_apply_to_tensors`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90921
Approved by: https://github.com/mrshenli, https://github.com/rohan-varma
2022-12-15 23:55:19 +00:00
9c912c7dd0 Revert "[quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90802)"
This reverts commit a66af1feba90cc64381bec45b0aa20ec778c92c5.

Reverted https://github.com/pytorch/pytorch/pull/90802 on behalf of https://github.com/malfet due to somehow broke test_resnet18 (quantization.fx.test_quantize_pt2e.TestQuantizePT2EModels), see a66af1feba
2022-12-15 23:28:21 +00:00
fdc973308b [inductor] Use --continue_on_fail when installing torchbench (#90922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90922
Approved by: https://github.com/xuzhao9
2022-12-15 22:52:40 +00:00
eqy
57e2090e21 [Dynamo][TIMM][Benchmarks] Fix TIMM 0.8.0dev breaking the timm_models.py script's data config (#90404)
It seems `0.8.0dev` breaks the current argument passing by expecting a dictionary instead of a namespace after 0dadb4a6e9

CC @desertfire @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90404
Approved by: https://github.com/ngimel
2022-12-15 22:21:19 +00:00
e686a442b4 If a torch.* returns non-Tensor, make this unimplemented rather than assert. (#89918)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89918
Approved by: https://github.com/albanD
2022-12-15 21:53:54 +00:00
a66af1feba [quant][pt2e] Add early prototype top level quantize_pt2e APIs (#90802)
Summary:
This PR introduces the top level APIs for quantization support in PyTorch 2.0 Export stack
* torch.ao.quantization.quantize_pt2e.prepare_pt2e
Takes a model that is captured by the PyTorch 2.0 export (torchdynamo full graph mode) and prepares the model for calibration
for post training quantization

* torch.ao.quantization.quantize_pt2e.convert_pt2e
Takes a calibrated model and converts that to a reference quantized model that can be lowered later to quantized operator libraries or delegation modules

Also added a backend config for the qnnpack_pt2e backend:
* torch.ao.quantization.backend_config.get_qnnpack_pt2e_backend_config

Note: everything related to quantize_pt2e are experimental (prototype), and we don't have any bc guarantees

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90802
Approved by: https://github.com/qihqi
2022-12-15 21:50:29 +00:00
201c36d81a Hack get_nbytes() to return 0 for sparse tensors as workaround for functionalization (#90702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90702
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2022-12-15 19:59:30 +00:00
15c9df7756 Error messages for kernel selection (#90783)
Summary: Error messages fro kernel selection

Test Plan: sandcastle & github

Differential Revision: D42008661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90783
Approved by: https://github.com/cpuhrsch
2022-12-15 18:41:12 +00:00
173accd1c1 [ao][fx] fixing public v private qconfig_mapping_utils.py (#88399)
Summary: made _check_is_valid_config_dict,
_compare_prepare_convert_qconfig_mappings,
_generate_node_name_to_qconfig,
_is_qconfig_supported_by_dtype_configs,
_maybe_adjust_qconfig_for_module_name_object_type_order,
_update_qconfig_for_fusion private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015544](https://our.internmc.facebook.com/intern/diff/D41015544)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88399
Approved by: https://github.com/jcaip
2022-12-15 17:48:34 +00:00
abc54f9314 Revert "Revert "[functorch] Refactor life handle storage (#90317)"" (#90856)
Adds the fix for -Wsign-compare.

See original PR (https://github.com/pytorch/pytorch/pull/90317) for
commit message
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90856
Approved by: https://github.com/samdow
2022-12-15 16:03:16 +00:00
81f351acd7 [inductor] Prevent blowup in inner_fn_str and extract_read_writes (#88933)
Currently the default `ops` handler expects strings as arguments and
just formats them into a function call template string. For complex
expressions, this can lead to exponential growth in terms. Say for
example you have:

```python
def fn(a):
   for _ in range(3)
       a = ops.mul(a, a)
   return a
```

You might expect `inner_fn_str` to contain 1 load and 3 multiplies,
but instead you find 8 loads and 7 multiplies:
```python
load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0) * load(arg_0, i0)
```

This type of blowup is present in the lowering for
`max_pool2d_with_indices_backward` which in #pytorch/torchdynamo#1352
was reported to have caused the entire compilation to hang.

This PR fixes the issue by formatting the string as a series of assignments to
variables, so for the example above, we now get:
```
tmp0 = load(arg_0, i0)
tmp1 = tmp0 * tmp0
tmp2 = tmp1 * tmp1
tmp3 = tmp2 * tmp2
return tmp3
```

Which corresponds to sequence of `ops` calls made.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88933
Approved by: https://github.com/jansel
2022-12-15 15:36:52 +00:00
c4718e9b09 [FSDP] Enable mixed hybrid/non-hybrid sharding strategies (#90846)
In the context of hybrid sharding strategies, we only need to enforce the same process groups among the instances using a hybrid sharding strategy, not all instances. We can even mix and match the two different hybrid sharding strategies. This PR relaxes the validation to support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90846
Approved by: https://github.com/rohan-varma
2022-12-15 15:36:23 +00:00
2f8c0cb2a4 [FSDP][Easy] Use run_subtests for hybrid shard test (#90859)
This PR uses `self.run_subtests` which exactly contains the `self.subTest` and `dist.barrier()` boilerplate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90859
Approved by: https://github.com/rohan-varma
2022-12-15 15:32:00 +00:00
b92975a6f3 replicate state_dict tests (#90868)
Simple tests for replicate() state_dict. Ensuring composition with
FSDP works will come as a follow up.

Differential Revision: [D42048131](https://our.internmc.facebook.com/intern/diff/D42048131/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90868
Approved by: https://github.com/awgu
2022-12-15 14:53:24 +00:00
d6fe9838d1 [inductor] add conv_transpose2d unary fusion for cpu in inference mode (#90265)
An FX transformation is added to fuse ConvTranspose2d with eltwise OPs in torchinductor for CPU in inference mode, following the implementation in https://github.com/pytorch/pytorch/pull/87063.

The fusion OP is implemented in https://github.com/pytorch/pytorch/pull/90264 and will be treated as an extern kernel call in torchinductor.

The fusion of ConvTranspose2d with the below OPs is supported:

- relu
- sigmoid
- tanh
- hardswish
- leaky_relu
- hardtanh
- gelu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90265
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-15 14:22:04 +00:00
85698d0ac4 add conv_transpose2d pointwise(unary) fusion kernel (#90264)
This PR adds `torch.ops.mkldnn._convolution_transpose_pointwise` that supports ConvTranspose fusion with the below unary pointwise OPs:

- relu
- sigmoid
- tanh
- hardswish
- leaky_relu
- hardtanh
- gelu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90264
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-15 14:16:58 +00:00
9b89ff0923 [Quant] onednn backend switch to ideep new api without affacting performance (#90354)
**Summary**
Onednn quantization backend switch to new API in `third_party/ideep`.
- `struct forward_params` for conv/deconv are changed. Modify primitive cache accordingly.
- Use new versions of `prepare` and `compute` API. Fp32 and int8 paths separated. The old ones will be deprecated.
- Now `ideep::tensor::reorder_if_differ_in` supports block-to-block reorder. Use it instead of defining a util function `onednn_utils::try_reorder`.
- For new API of transposed convolution, we can use a flag to keep weight desc align with oneDNN thus needless to transpose it explicitly in PyTorch.
- Use `is_channels_last` flag to specify layout of src/dst when querying expected weight desc.

It won't impact correctness. Performance should be unaffected or slightly better.
FBGEMM and QNNPACK backends are not affected.

Performance results are given below.
1. End-to-end performance of static quantized models (from torchvision)
(throughput: fps, higher is better)
![image](https://user-images.githubusercontent.com/12522207/206105879-45c59996-9804-4531-aa1f-dc962e6db5ab.png)

2. Op benchmark of dynamic quantized linear
(Latency: ms, lower is better)
![image](https://user-images.githubusercontent.com/12522207/206124949-77352991-0fda-4285-a484-e20a5797262b.png)

Test method & env:
- Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
- Run multi-instances on a single node. Use one core for each instance.
- Use Jemalloc and Intel OpenMP

**Test plan**
python test/test_quantization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90354
Approved by: https://github.com/jgong5
2022-12-15 12:48:45 +00:00
79009cbc53 [CUDA 12] Fix the endif guard position for cusparse const descriptors (#90897)
[CUDA 12] Fix the endif guard position for cusparse const descriptors

Related https://github.com/pytorch/pytorch/pull/90765
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90897
Approved by: https://github.com/IvanYashchuk
2022-12-15 11:28:54 +00:00
98799ca0f4 [Composable API] replicate: cleanup _ddp.py (#90257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90257
Approved by: https://github.com/mrshenli
2022-12-15 08:48:57 +00:00
0b22f5ae9f Deeply rework WeakIdKeyDictionary (#90825)
In the prior patch, I just YOLOed a mutable mapping implementation.
Many edge cases were not handled correctly.  In this PR, I just
copy paste the WeakKeyDictionary from CPython and the hacked it up
to use WeakIdRef instead of weakref.ref.  You can see each line
I changed with the comment CHANGED; there aren't many.

Being exactly API compatible with WeakKeyDictionary means I can also
rob all of the tests from CPython, which I also did for
test/test_weak.py

How to review?  You could either try taking the delta from CPython
(recommended), or review everything from scratch (not recommended).
Can post diff representing delta on request.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90825
Approved by: https://github.com/albanD
2022-12-15 08:43:08 +00:00
54563e6288 Don't put tracing state on Tensor (#90628)
Fixes https://github.com/pytorch/pytorch/issues/89626

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90628
Approved by: https://github.com/voznesenskym
2022-12-15 08:43:08 +00:00
103029e035 inductor: sort the reads buf by name (#89744)
Sort `read_writes.reads` by name to make sure the same graph is generated for a fixed model. Otherwise, the buffer reuse may be different since the order of `read_writes.reads` is random.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89744
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-15 08:42:49 +00:00
cyy
9fe050f39c fix cudnn RNN reproducibility problem (#90522)
Fixes #74177

Since RNN code use static variables to cache state, we store an atomic_flag in RNG generator to notify new seed changes and generate new random state for RNN. The additional cost is that the it must check the atomic_flag each time to ensure reproducibility.  This may be ugly but it is the best way currently without large code refactoring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90522
Approved by: https://github.com/ngimel
2022-12-15 08:21:37 +00:00
cyy
dcfe7ff7e2 fix a memory leak on return without free (#90372)
This issue is found by static analysis. The allocated object by new may be leaked on early return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90372
Approved by: https://github.com/swolchok
2022-12-15 07:07:48 +00:00
0ac0af02d5 Reland Fix issue 38095 TODO in test_multiprocessing.py (#90741)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095
Reland of https://github.com/pytorch/pytorch/pull/90335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90741
Approved by: https://github.com/clee2000
2022-12-15 05:32:27 +00:00
4e6455163f Fix unittest rerun logic when checking for skipped tests (#90888)
I made an important mistake here when thinking `not result.skipped` mean that the current test wasn't skipped.

Similar to `result.failures` or `result.errors`, `result.skipped` is that it's a list including all the skipped messages so far in the test suite (https://docs.python.org/3/library/unittest.html#unittest.TestResult).  As such, the correct way to check if the current test was skipped is to compare `skipped_before` and `len(result.skipped)` after running the test in the same way as failures and errors are handled.  If they are the same, the test isn't skipped.

### Testing

`python test/run_test.py -i test_autograd --verbose` to confirm that the disabled test `test_profiler_seq_nr` is run 50 times always in rerun mode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90888
Approved by: https://github.com/clee2000
2022-12-15 05:13:59 +00:00
2ba5c1d7c4 Inductor cpp wrapper: change inputs args from tuple to vector (#90754)
## Pitch
Change input args type from `std::tuple` to `std::vector` to reduce the compilation time.

## Description
`std::tie()` takes quite a long time during the compilation when the input args number grows.

For example, for a graph from the `PegasusForConditionalGeneration` model with 318 input args, the compilation of `std::tie` for the args is about 10s. By changing to std::vector, the compilation time of arg assignment is reduced to less than 1s.

### Code before:
```cpp
at::Tensor call_0(std::tuple<at::Tensor&, at::Tensor&> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    ...
    return buf0;
}
```

### Code after:
```cpp
at::Tensor call_0(std::vector<at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    arg0_1 = args[0];
    arg1_1 = args[1];
    ...
    return buf0;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90754
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-15 05:07:16 +00:00
39d9dd135a [FSDP][Easy] ufmt files (#90858)
```
ufmt format torch/distributed/fsdp
ufmt format test/distributed/fsdp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90858
Approved by: https://github.com/rohan-varma
2022-12-15 04:15:26 +00:00
670efb974a [CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs (#90427)
Fixes https://github.com/pytorch/pytorch/issues/89836

This PR changes the CUDA kernels of grid_sample 2d and 3d, forward, to use accumulate type to improve accuracy on half precision inputs.

Also, the backward error on grad with half input is in the order of 1e-4, unlike 1e2 in forward process. The backward kernels are thus unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90427
Approved by: https://github.com/ngimel
2022-12-15 03:41:35 +00:00
eecd621f06 [cuDNN][cuDNN V8 API] (re-re-open) cuDNN V8 API on by default (#89022)
Testing V8 on by default again after fixes have been merged for e.g., https://github.com/pytorch/torchdynamo/issues/1833

One new failure that seems to be surfaced with V8 on appears in halonext + amp
```
RuntimeError: Internal Triton PTX codegen error:
Segmentation fault (core dumped)
```
But I'm not sure if this points to a V8 issue or a Triton issue CC @ngimel @ptrblck

Current dynamo benchmarks on A100:
v7 vs. v8
|dev |name                           |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3               |128       |166.0240      |165.5798      |
|cuda|beit_base_patch16_224          |64        |123.5912      |123.0797      |
|cuda|botnet26t_256                  |128       |107.7343      |107.5948      |
|cuda|cait_m36_384                   |4         |184.5038      |184.0271      |
|cuda|coat_lite_mini                 |128       |142.3061      |140.5814      |
|cuda|convit_base                    |64        |165.2499      |161.0743      |
|cuda|convmixer_768_32               |32        |325.6984      |325.7094      |
|cuda|convnext_base                  |64        |237.4632      |238.0142      |
|cuda|crossvit_9_240                 |128       |72.2980       |72.4367       |
|cuda|cspdarknet53                   |64        |96.6862       |96.8308       |
|cuda|deit_base_distilled_patch16_224|64        |117.6045      |117.9616      |
|cuda|dla102                         |128       |182.3073      |182.2304      |
|cuda|dm_nfnet_f0                    |128       |133.6011      |133.6298      |
|cuda|dpn107                         |32        |148.5080      |148.5885      |
|cuda|eca_botnext26ts_256            |128       |113.8676      |113.1514      |
|cuda|eca_halonext26ts               |128       |119.2242      |119.1845      |
|cuda|ese_vovnet19b_dw               |128       |80.0217       |79.9438       |
|cuda|fbnetc_100                     |128       |91.4548       |91.4009       |
|cuda|fbnetv3_b                      |128       |115.4496      |115.5058      |
|cuda|gernet_l                       |128       |114.8365      |114.7870      |
|cuda|ghostnet_100                   |128       |58.5766       |58.5766       |
|cuda|gluon_inception_v3             |128       |165.5222      |165.7167      |
|cuda|gluon_xception65               |32        |165.8779      |165.7818      |
|cuda|gmixer_24_224                  |128       |116.3611      |113.4925      |
|cuda|gmlp_s16_224                   |128       |121.2607      |121.2534      |
|cuda|hrnet_w18                      |128       |246.5706      |246.7599      |
|cuda|inception_v3                   |128       |166.1096      |166.2034      |
|cuda|jx_nest_base                   |32        |93.6064       |93.4088       |
|cuda|lcnet_050                      |128       |21.4156       |21.4207       |
|cuda|levit_128                      |128       |27.2901       |27.2543       |
|cuda|mixer_b16_224                  |128       |157.8992      |158.2878      |
|cuda|mixnet_l                       |128       |197.3443      |197.2125      |
|cuda|mnasnet_100                    |128       |71.4604       |71.2997       |
|cuda|mobilenetv2_100                |128       |67.6080       |67.7515       |
|cuda|mobilenetv3_large_100          |128       |57.7224       |57.6591       |
|cuda|mobilevit_s                    |64        |93.0372       |93.0530       |
|cuda|nfnet_l0                       |128       |113.1664      |113.2853      |
|cuda|pit_b_224                      |64        |133.3333      |133.4153      |
|cuda|pnasnet5large                  |16        |238.9545      |238.8122      |
|cuda|poolformer_m36                 |64        |144.2353      |144.2375      |
|cuda|regnety_002                    |128       |32.8534       |32.9069       |
|cuda|repvgg_a2                      |128       |102.4150      |102.3827      |
|cuda|res2net101_26w_4s              |64        |120.8127      |120.8322      |
|cuda|res2net50_14w_8s               |128       |149.7052      |149.8969      |
|cuda|res2next50                     |128       |153.7439      |153.8215      |
|cuda|resmlp_12_224                  |128       |89.1918       |86.9226       |
|cuda|resnest101e                    |64        |159.4706      |159.3133      |
|cuda|rexnet_100                     |128       |88.0032       |88.0397       |
|cuda|sebotnet33ts_256               |64        |80.4635       |80.0120       |
|cuda|selecsls42b                    |128       |70.4430       |70.3663       |
|cuda|spnasnet_100                   |128       |78.0537       |78.1991       |
|cuda|swin_base_patch4_window7_224   |64        |212.9073      |213.0824      |
|cuda|swsl_resnext101_32x16d         |32        |193.0229      |193.0404      |
|cuda|tf_efficientnet_b0             |128       |97.1316       |97.0410       |
|cuda|tf_mixnet_l                    |128       |203.4956      |203.5340      |
|cuda|tinynet_a                      |128       |82.4038       |82.8733       |
|cuda|tnt_s_patch16_224              |128       |284.8576      |284.8867      |
|cuda|twins_pcpvt_base               |64        |118.3893      |119.2329      |
|cuda|visformer_small                |128       |126.0533      |126.0390      |
|cuda|vit_base_patch16_224           |64        |118.2873      |118.0573      |
|cuda|volo_d1_224                    |64        |108.7764      |108.2063      |
|cuda|xcit_large_24_p8_224           |5         |100.4656      |100.5209      |

v7 vs. v8 amp

|dev |name                           |batch_size|abs_latency_v7|abs_latency_v8|
|----|-------------------------------|----------|--------------|--------------|
|cuda|adv_inception_v3               |128       |104.9729      |105.1237      |
|cuda|beit_base_patch16_224          |64        |75.4330       |75.2039       |
|cuda|botnet26t_256                  |128       |74.5149       |74.8071       |
|cuda|cait_m36_384                   |4         |110.9788      |111.5170      |
|cuda|coat_lite_mini                 |128       |62.3618       |64.4965       |
|cuda|convit_base                    |64        |116.4054      |117.9129      |
|cuda|convmixer_768_32               |32        |264.4401      |264.4491      |
|cuda|convnext_base                  |64        |182.9009      |179.2136      |
|cuda|crossvit_9_240                 |128       |48.8586       |48.8359       |
|cuda|cspdarknet53                   |64        |80.0245       |80.0160       |
|cuda|deit_base_distilled_patch16_224|64        |66.5921       |66.7448       |
|cuda|dla102                         |128       |116.7780      |117.1683      |
|cuda|dm_nfnet_f0                    |128       |78.9322       |79.1135       |
|cuda|dpn107                         |32        |85.5206       |85.7514       |
|cuda|eca_botnext26ts_256            |128       |76.3672       |77.0050       |
|cuda|eca_halonext26ts               |128       |86.2458       |              |
|cuda|ese_vovnet19b_dw               |128       |43.2943       |43.3379       |
|cuda|fbnetc_100                     |128       |54.8479       |54.9251       |
|cuda|fbnetv3_b                      |128       |70.7504       |71.0188       |
|cuda|gernet_l                       |128       |66.1607       |66.0379       |
|cuda|ghostnet_100                   |128       |43.8882       |43.9336       |
|cuda|gluon_inception_v3             |128       |104.9297      |105.0204      |
|cuda|gluon_xception65               |32        |85.7118       |85.8370       |
|cuda|gmixer_24_224                  |128       |75.1214       |76.1170       |
|cuda|gmlp_s16_224                   |128       |76.4207       |76.6641       |
|cuda|hrnet_w18                      |128       |186.1326      |186.2435      |
|cuda|inception_v3                   |128       |105.0561      |105.0783      |
|cuda|jx_nest_base                   |32        |65.3066       |65.3245       |
|cuda|lcnet_050                      |128       |14.7991       |14.8687       |
|cuda|levit_128                      |128       |19.2893       |19.4772       |
|cuda|mixer_b16_224                  |128       |93.9826       |94.2056       |
|cuda|mixnet_l                       |128       |147.1245      |147.0435      |
|cuda|mnasnet_100                    |128       |39.1781       |39.2565       |
|cuda|mobilenetv2_100                |128       |42.3704       |42.3114       |
|cuda|mobilenetv3_large_100          |128       |37.2946       |37.2816       |
|cuda|mobilevit_s                    |64        |55.8930       |55.8934       |
|cuda|nfnet_l0                       |128       |64.0448       |64.4438       |
|cuda|pit_b_224                      |64        |80.6342       |80.2933       |
|cuda|pnasnet5large                  |16        |154.9611      |154.8654      |
|cuda|poolformer_m36                 |64        |101.7489      |101.8138      |
|cuda|regnety_002                    |128       |27.0939       |27.0309       |
|cuda|repvgg_a2                      |128       |60.9651       |61.2533       |
|cuda|res2net101_26w_4s              |64        |77.3291       |77.4739       |
|cuda|res2net50_14w_8s               |128       |93.6572       |93.7221       |
|cuda|res2next50                     |128       |112.4975      |112.3248      |
|cuda|resmlp_12_224                  |128       |59.5422       |60.7644       |
|cuda|resnest101e                    |64        |97.9894       |98.3358       |
|cuda|rexnet_100                     |128       |55.2218       |55.0718       |
|cuda|sebotnet33ts_256               |64        |60.4880       |60.8113       |
|cuda|selecsls42b                    |128       |41.4294       |41.5341       |
|cuda|spnasnet_100                   |128       |45.0037       |45.0304       |
|cuda|swin_base_patch4_window7_224   |64        |98.2561       |98.6925       |
|cuda|swsl_resnext101_32x16d         |32        |100.6179      |100.9195      |
|cuda|tf_efficientnet_b0             |128       |56.5344       |56.4591       |
|cuda|tf_mixnet_l                    |128       |153.0318      |152.9367      |
|cuda|tinynet_a                      |128       |54.1307       |53.9298       |
|cuda|tnt_s_patch16_224              |128       |142.4801      |142.6589      |
|cuda|twins_pcpvt_base               |64        |67.9027       |67.8325       |
|cuda|visformer_small                |128       |72.5589       |72.9427       |
|cuda|vit_base_patch16_224           |64        |71.4885       |71.7342       |
|cuda|volo_d1_224                    |64        |69.3539       |69.5910       |
|cuda|xcit_large_24_p8_224           |5         |59.9000       |59.9699       |

v7 vs. v8 float16
|dev |name                           |batch_size|abs_latency|abs_latency|
|----|-------------------------------|----------|-----------|-----------|
|cuda|adv_inception_v3               |128       |104.2544   |104.2677   |
|cuda|beit_base_patch16_224          |64        |85.3601    |85.3786    |
|cuda|botnet26t_256                  |128       |72.1476    |71.8277    |
|cuda|cait_m36_384                   |4         |108.3075   |108.5941   |
|cuda|coat_lite_mini                 |128       |61.2382    |61.6049    |
|cuda|convmixer_768_32               |32        |263.3818   |263.3598   |
|cuda|convnext_base                  |64        |172.6821   |173.8520   |
|cuda|crossvit_9_240                 |128       |44.6321    |44.6340    |
|cuda|cspdarknet53                   |64        |79.3165    |79.2964    |
|cuda|deit_base_distilled_patch16_224|64        |61.9816    |62.2109    |
|cuda|dla102                         |128       |115.7403   |115.9928   |
|cuda|dm_nfnet_f0                    |128       |77.5434    |77.7440    |
|cuda|dpn107                         |32        |83.6489    |83.5605    |
|cuda|eca_botnext26ts_256            |128       |73.9953    |74.1031    |
|cuda|eca_halonext26ts               |128       |81.7951    |81.7103    |
|cuda|ese_vovnet19b_dw               |128       |42.9618    |42.8853    |
|cuda|fbnetc_100                     |128       |54.3590    |54.3575    |
|cuda|fbnetv3_b                      |128       |69.7977    |70.1696    |
|cuda|gernet_l                       |128       |64.8684    |65.1726    |
|cuda|ghostnet_100                   |128       |43.2054    |43.1319    |
|cuda|gluon_inception_v3             |128       |104.1988   |104.3030   |
|cuda|gluon_xception65               |32        |84.2245    |84.5085    |
|cuda|gmixer_24_224                  |128       |82.0418    |82.7252    |
|cuda|gmlp_s16_224                   |128       |75.4792    |75.8374    |
|cuda|hrnet_w18                      |128       |184.1450   |184.1848   |
|cuda|inception_v3                   |128       |104.1203   |104.2536   |
|cuda|jx_nest_base                   |32        |58.2386    |58.4901    |
|cuda|lcnet_050                      |128       |14.6409    |14.5616    |
|cuda|levit_128                      |128       |22.3875    |22.4680    |
|cuda|mixer_b16_224                  |128       |98.9534    |98.4730    |
|cuda|mixnet_l                       |128       |146.1623   |146.1947   |
|cuda|mnasnet_100                    |128       |38.9208    |39.3463    |
|cuda|mobilenetv2_100                |128       |41.8946    |41.9847    |
|cuda|mobilenetv3_large_100          |128       |36.7810    |36.8264    |
|cuda|mobilevit_s                    |64        |55.3211    |55.3186    |
|cuda|nfnet_l0                       |128       |63.1302    |63.5544    |
|cuda|pit_b_224                      |64        |73.8752    |73.4602    |
|cuda|pnasnet5large                  |16        |151.6806   |151.6111   |
|cuda|poolformer_m36                 |64        |86.8341    |86.8021    |
|cuda|regnety_002                    |128       |26.6798    |26.5295    |
|cuda|repvgg_a2                      |128       |61.6652    |62.1482    |
|cuda|res2net101_26w_4s              |64        |75.8037    |75.7739    |
|cuda|res2net50_14w_8s               |128       |92.6362    |92.4338    |
|cuda|res2next50                     |128       |111.5371   |111.5832   |
|cuda|resmlp_12_224                  |128       |58.2349    |57.9807    |
|cuda|resnest101e                    |64        |96.1114    |96.2742    |
|cuda|rexnet_100                     |128       |54.8138    |54.7643    |
|cuda|sebotnet33ts_256               |64        |53.1524    |53.3823    |
|cuda|selecsls42b                    |128       |40.6070    |40.7104    |
|cuda|spnasnet_100                   |128       |44.5732    |44.4318    |
|cuda|swin_base_patch4_window7_224   |64        |98.6447    |98.8445    |
|cuda|swsl_resnext101_32x16d         |32        |97.0195    |97.2968    |
|cuda|tf_efficientnet_b0             |128       |56.0640    |56.0278    |
|cuda|tf_mixnet_l                    |128       |152.0958   |152.0874   |
|cuda|tinynet_a                      |128       |53.3694    |53.3762    |
|cuda|tnt_s_patch16_224              |128       |130.2981   |130.3726   |
|cuda|twins_pcpvt_base               |64        |62.5459    |62.6416    |
|cuda|visformer_small                |128       |68.8502    |69.1756    |
|cuda|vit_base_patch16_224           |64        |65.8587    |66.0285    |
|cuda|volo_d1_224                    |64        |64.5348    |64.6057    |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89022
Approved by: https://github.com/ngimel
2022-12-15 03:24:44 +00:00
6a866c3ed1 [ao] fixing public v private for torch.ao.nn.X (#87883)
Summary: this mostly consisted of adding __all__ to files without them.
A few functions in X.utils were made private too

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40814548](https://our.internmc.facebook.com/intern/diff/D40814548)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87883
Approved by: https://github.com/jcaip, https://github.com/anjali411
2022-12-15 03:03:07 +00:00
edc5bb5fbe Only populate real_value_cache during export (#90468)
Fixes https://github.com/pytorch/torchdynamo/issues/1950

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90468
Approved by: https://github.com/voznesenskym
2022-12-15 02:28:21 +00:00
f286cbebce [ao][fx] fixing public v private graph_module.py (#88395)
Summary: made _is_observed_module, _is_observed_standalone_module
private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015545](https://our.internmc.facebook.com/intern/diff/D41015545)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88395
Approved by: https://github.com/jcaip
2022-12-15 02:15:04 +00:00
283cf718ed Fix _fix_weakref memory leak (#90823)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90823
Approved by: https://github.com/eellison, https://github.com/albanD
2022-12-15 01:07:29 +00:00
d19791e4cd add autocast keys to pybind11 DispatchKey object (#90821)
Summary:

This is useful for debugging what autocast is doing when
it's running on top of torchdynamo, without this the Python dispatch
key for autocast prints as `???`.

Test Plan:

```
import torch
dir(torch._C.DispatchKey)
// the autocast keys show up now
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90821
Approved by: https://github.com/ezyang
2022-12-15 00:15:07 +00:00
86269852de Serialize dynamo/inductor config for minifier (#90501)
Fixes https://github.com/pytorch/torchdynamo/issues/1965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90501
Approved by: https://github.com/mlazos
2022-12-14 23:44:06 +00:00
e585156c59 [JIT] Frozen Graph Linear-BatchNormNd Folding (#86706)
This PR adds linear-batchnormNd folding for JIT frozen graphs.

**Performance benchmark**
A preliminary benchmark with a simple model of linear+bn1d tested on first socket, physical cores of skylake machine.

**FP32, JIT**
without linear-bn folding
![Screenshot (1368)](https://user-images.githubusercontent.com/93151422/195168944-cfc5b920-bc82-4be1-a221-d194c8fa6c18.png)

with linear-bn folding
![Screenshot (1367)](https://user-images.githubusercontent.com/93151422/195168926-267b0515-45a1-4f08-922d-c150845199ae.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86706
Approved by: https://github.com/davidberard98
2022-12-14 23:24:50 +00:00
1ca9d43d4e [ao] quantize.py fixing public v private (#87521)
Summary: made _register_activation_post_process_hook, _add_observer,
_get_unique_devices_, _get_observer_dict private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709277](https://our.internmc.facebook.com/intern/diff/D40709277)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87521
Approved by: https://github.com/jerryzh168
2022-12-14 22:50:39 +00:00
691a44f403 [Quant][fx][bc-breaking] Add simpler BackendConfig pattern format (#90698)
Summary: The existing BackendConfig fusion pattern
uses a "reversed nested tuple" format that is highly
unintuitive. For example,
```
linear-relu -> (nn.ReLU, nn.Linear)
conv-bn-relu -> (nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))
```
This pattern format also complicates the signatures
of the user specified "fuser methods", which needed
to accept arguments in reverse nested order to match
the patterns:
```
def fuse_linear_relu(is_qat, relu, linear):
    ...

def fuse_conv_bn_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    ...
```
Instead, this commit introduces a new pattern format that
simply specifies the ops in forward order with no nesting:
```
linear-relu -> (nn.Linear, nn.ReLU)
conv-bn-relu -> (nn.Conv2d, nn.BatchNorm2d, nn.ReLU)

def fuse_linear_relu(is_qat, linear, relu):
    ...

def fuse_conv_bn_relu(is_qat, conv, bn, relu):
    ...
```
Note that the legacy "reversed nested tuple" is still
used internally since it is more general. In the
future, we should replace it with the format used in
the subgraph rewriter in `torch.fx`, and simplify the
existing pattern matching code to handle the new
format added in this commit.

BC-breaking Notes:

Before:
```
import torch as nn
import torch.ao.nn.intrinsic as nni
from torch.ao.quantization.backend_config import BackendPatternConfig

def fuse_linear_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    return nni.ConvBnReLU2d(conv, bn, relu)

config = BackendPatternConfig((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) \
    .set_dtype_configs(...) \
    .set_fuser_method(fuse_conv_bn_relu) \
    .set_fused_module(nni.ConvBnReLU2d)
```

After:
```
def fuse_linear_relu(is_qat, conv, bn, relu):
    return nni.ConvBnReLU2d(conv, bn, relu)

config = BackendPatternConfig((nn.Conv2d, nn.BatchNorm2d, nn.ReLU)) \
    .set_dtype_configs(...) \
    .set_fuser_method(fuse_conv_bn_relu) \
    .set_fused_module(nni.ConvBnReLU2d)
```

OR (for backward-compatibility)

```
def fuse_linear_relu(is_qat, relu, bn_conv):
    (bn, conv) = bn_conv
    return nni.ConvBnReLU2d(conv, bn, relu)

config = BackendPatternConfig() \
    ._set_pattern_complex_format((nn.ReLU, (nn.BatchNorm2d, nn.Conv2d))) \
    .set_dtype_configs(...) \
    .set_fuser_method(fuse_conv_bn_relu) \
    .set_fused_module(nni.ConvBnReLU2d) \
    ._set_use_legacy_pattern_format(True)
```

Before:
```
backend_config.configs  # returns Dict[Pattern, BackendPatternConfig]
```

After:
```
backend_config.configs  # returns List[BackendPatternConfig]
```

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestBackendConfig

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo

Differential Revision: [D41954553](https://our.internmc.facebook.com/intern/diff/D41954553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90698
Approved by: https://github.com/vkuzo, https://github.com/jerryzh168
2022-12-14 22:44:29 +00:00
1e347b737b Run MPS PR tests on both Ventura and Monterey (#89312)
Add `runs-on` input parameter to _mac-test-mps.yml and run `ciflow/mps` on both Monterey and Ventura machines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89312
Approved by: https://github.com/huydhn
2022-12-14 22:05:33 +00:00
7a112c43c1 [DataLoader2] Fix apply_sharding to accept one sharding_filter per branch (#90769)
Changes:
- Allow multiple `sharding_filter` in the pipeline as long as they are not on the same branch
- [x] Add test

Example:
```mermaid
graph TD;
DP1-->sharding_filter_1;
sharding_filter_1-->DP3;
DP2-->sharding_filter_2;
sharding_filter_2-->DP4;
DP3-->DP4;
DP4-->output;
```
In order to properly shard `DP1` and `DP2`, we should allow multiple `sharding_filter`s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90769
Approved by: https://github.com/NivekT
2022-12-14 22:03:41 +00:00
1ba4e3c711 [FSDP][BE] Remove _module_to_handles, HandleConfig; use term "fqn"; clarify docs (#90840)
This PR
- Removes `_module_to_handles` since it is no longer used. We instead use `_comm_module_to_handles`.
- Removes `HandleConfig` and stores its fields directly as attributes on `FlatParamHandle`.
- Uses the term `fqn`/`fqns` uniformly in `flat_param.py` instead of `prefixed_param_name` / `prefixed_param_names`.
- Clarifies some documentation.

I am including all of these BE items in the same PR to save CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90840
Approved by: https://github.com/rohan-varma
2022-12-14 21:37:37 +00:00
8090cb5386 Add macro C10_AS_INTARRAYREF_SLOW (#90675)
This makes it easier to narrow down who is throwing the error,
instead of having to use gdb.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90675
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/JackCaoG
2022-12-14 21:29:23 +00:00
cdf4a80cc1 replace skipIf with xfailif (#90368)
Replace skips with xfails.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90368
Approved by: https://github.com/zou3519
2022-12-14 20:35:58 +00:00
fb18c29486 [BE] Tweak Meta copyright headers (#90805)
s/Facebook, Inc./Meta Platforms, Inc/
s/Confidential and proprietary./This source code is licensed under the BSD-style license/

Per https://www.internalfb.com/intern/wiki/Open_Source/Licenses/Straight_BSD/

Also, add linter that prevents adding those in the future

Fixes https://github.com/pytorch/pytorch/issues/90187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90805
Approved by: https://github.com/zpao
2022-12-14 20:30:31 +00:00
f3393b7ea7 [torchgen] Introduce Executorch types and signatures (#90781)
Retry of #90591, which is a retry of #89595. Reverted due to dependency PR breaking internal fbcode.

## Forked BaseCppType
Created a module for Executorch: `torchgen.executorch`.

## In `torchgen.executorch.api.types.types`:

* Define `BaseCppType` with `torch::executor` namespace.
## In `torchgen.executorch.api.et_cpp`:

* Help generate `NamedCType` for `ExecutorchCppSignature` arguments.
## In `torchgen.executorch.api.types.signatures`:

* Define the signature using these types. (`ExecutorchCppSignature`)
## In `torchgen.executorch.api.types.__init__`:

* Suppress flake8 error for `import *`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90781
Approved by: https://github.com/ezyang
2022-12-14 20:13:04 +00:00
4adffe6d51 [torchgen] Let native function declaration generation logic take a callable (#90780)
Retry of #90590, which is a retry of #89594. Original PR reverted due to internal breakage.
This PR fixes the breakage by adding a default value to the new argument.

This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90780
Approved by: https://github.com/ezyang
2022-12-14 20:13:04 +00:00
df58020bb6 Align max_pool1d Error Checking between CPU and CUDA/CPU requires_grad (#90211)
Fixes https://github.com/pytorch/pytorch/issues/85712

Standardizes error checking for max_pool1d between CPU and CPU requires_grad/CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90211
Approved by: https://github.com/mruberry
2022-12-14 20:12:09 +00:00
3859aace20 [MPS] Skip tests broken on Ventura (#90843)
Also add `torch.backends.mps.is_macos13_or_newer`
See https://github.com/pytorch/pytorch/issues/85758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90843
Approved by: https://github.com/kulinseth, https://github.com/albanD
2022-12-14 19:51:00 +00:00
8a21cac3c3 Improve interpolate() speed for channels_last CPU videos (#90302)
This is the exact same PR as https://github.com/pytorch/pytorch/pull/86361, but on Videos (3D) instead of images (2D).

For torchvision training use-cases (num_threads=1), the speed-ups range in 1X-2X.  When num_threads>1 the speed-ups are a lot higher, up to ~30X

Benchmarks details:
<details >

```
main branch=c6942dbbfbf836450898aa9a0c08aefe437d0765
input shape            output size      mode            dtype     num_threads  speed-up  main   PR
(1, 3, 8, 256, 256) -> (16, 320, 320)  linear          float32    num_threads=1   1.0X  54.7ms vs 55.7ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         float32    num_threads=1   1.7X  40.5ms vs 24.4ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         uint8      num_threads=1   1.4X  33.1ms vs 23.7ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   float32    num_threads=1   2.0X  47.5ms vs 24.3ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   uint8      num_threads=1   1.7X  39.9ms vs 23.7ms

(1, 3, 8, 256, 256) -> (16, 320, 320)  linear          float32    num_threads=2   2.2X  54.6ms vs 25.1ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         float32    num_threads=2   2.3X  21.2ms vs 9.3ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         uint8      num_threads=2   1.4X  16.5ms vs 12.0ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   float32    num_threads=2   2.6X  24.3ms vs 9.3ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   uint8      num_threads=2   1.7X  19.9ms vs 12.0ms

(1, 3, 8, 256, 256) -> (16, 320, 320)  linear          float32    num_threads=12  10X   54.3ms vs 5.4ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         float32    num_threads=12  2.5X  4.1ms vs 1.6ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         uint8      num_threads=12  1.4X  2.9ms vs 2.1ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   float32    num_threads=12  1.7X  4.8ms vs 2.8ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   uint8      num_threads=12  1.7X  3.5ms vs 2.1ms

(1, 3, 8, 256, 256) -> (16, 320, 320)  linear          float32    num_threads=32  20X   54.2ms vs 2.7ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         float32    num_threads=32  1.5X  2.2ms vs 1.5ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest         uint8      num_threads=32  1.6X  1.3ms vs 0.8ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   float32    num_threads=32  1.3X  1.8ms vs 1.4ms
(1, 3, 8, 256, 256) -> (16, 320, 320)  nearest-exact   uint8      num_threads=32  1.7X  1.3ms vs 0.8ms

(1, 3, 16, 320, 320) -> (8, 256, 256)  linear          float32    num_threads=1   1.0X  15.4ms vs 16.0ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         float32    num_threads=1   2.0X  12.3ms vs 6.0ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         uint8      num_threads=1   1.6X  12.0ms vs 7.7ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   float32    num_threads=1   2.2X  13.1ms vs 6.0ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   uint8      num_threads=1   1.7X  12.8ms vs 7.6ms

(1, 3, 16, 320, 320) -> (8, 256, 256)  linear          float32    num_threads=2   1.9X  15.5ms vs 8.2ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         float32    num_threads=2   2.0X  6.1ms vs 3.1ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         uint8      num_threads=2   1.5X  6.0ms vs 3.9ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   float32    num_threads=2   2.2X  6.6ms vs 3.0ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   uint8      num_threads=2   1.7X  6.5ms vs 3.9ms

(1, 3, 16, 320, 320) -> (8, 256, 256)  linear          float32    num_threads=12  11X   15.5ms vs 1.4ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         float32    num_threads=12  2.0X  1.1ms vs 0.5ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         uint8      num_threads=12  1.6X  1.1ms vs 0.7ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   float32    num_threads=12  2.1X  1.2ms vs 0.5ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   uint8      num_threads=12  1.5X  1.1ms vs 0.8ms

(1, 3, 16, 320, 320) -> (8, 256, 256)  linear          float32    num_threads=32  15X   15.4ms vs 1.0ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         float32    num_threads=32  1.7X  0.7ms vs 0.4ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest         uint8      num_threads=32  1.3X  0.7ms vs 0.5ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   float32    num_threads=32  3X    0.7ms vs 0.2ms
(1, 3, 16, 320, 320) -> (8, 256, 256)  nearest-exact   uint8      num_threads=32  2.6X  0.7ms vs 0.3ms

(1, 3, 16, 320, 320) -> (32, 512, 512)  linear          float32    num_threads=1   1.0X  295.6ms vs 304.3ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         float32    num_threads=1   1.5X  223.2ms vs 144.3ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         uint8      num_threads=1   1.5X  177.7ms vs 121.0ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   float32    num_threads=1   1.8X  258.6ms vs 145.3ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   uint8      num_threads=1   1.6X  203.9ms vs 128.6ms

(1, 3, 16, 320, 320) -> (32, 512, 512)  linear          float32    num_threads=2   1.8X  295.4ms vs 160.4ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         float32    num_threads=2   1.5X  119.0ms vs 80.2ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         uint8      num_threads=2   1.4X  84.8ms vs 60.6ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   float32    num_threads=2   1.7X  136.1ms vs 80.1ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   uint8      num_threads=2   1.7X  102.2ms vs 60.5ms

(1, 3, 16, 320, 320) -> (32, 512, 512)  linear          float32    num_threads=12  9X    295.3ms vs 32.3ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         float32    num_threads=12  1.4X  25.2ms vs 18.7ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         uint8      num_threads=12  1.4X  16.5ms vs 11.9ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   float32    num_threads=12  1.5X  28.1ms vs 18.8ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   uint8      num_threads=12  1.7X  19.4ms vs 11.5ms

(1, 3, 16, 320, 320) -> (32, 512, 512)  linear          float32    num_threads=32  18X   294.7ms vs 16.2ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         float32    num_threads=32  1.2X  14.4ms vs 12.5ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest         uint8      num_threads=32  1.2X  5.9ms vs 4.8ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   float32    num_threads=32  1.2X  14.5ms vs 12.5ms
(1, 3, 16, 320, 320) -> (32, 512, 512)  nearest-exact   uint8      num_threads=32  1.4X  6.9ms vs 4.8ms

(1, 3, 32, 512, 512) -> (16, 320, 320)  linear          float32    num_threads=1   0.9X  48.6ms vs 55.1ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         float32    num_threads=1   2.0X  38.8ms vs 19.2ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         uint8      num_threads=1   1.6X  37.6ms vs 23.8ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   float32    num_threads=1   2.1X  41.2ms vs 19.2ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   uint8      num_threads=1   1.7X  39.9ms vs 23.8ms

(1, 3, 32, 512, 512) -> (16, 320, 320)  linear          float32    num_threads=2   1.9X  48.8ms vs 25.3ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         float32    num_threads=2   2.0X  19.2ms vs 9.5ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         uint8      num_threads=2   1.6X  18.8ms vs 12.0ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   float32    num_threads=2   2.2X  20.5ms vs 9.5ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   uint8      num_threads=2   1.7X  20.0ms vs 12.0ms

(1, 3, 32, 512, 512) -> (16, 320, 320)  linear          float32    num_threads=12  11X   48.6ms vs 4.6ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         float32    num_threads=12  2.0X  3.4ms vs 1.7ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         uint8      num_threads=12  1.6X  3.3ms vs 2.1ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   float32    num_threads=12  2.1X  3.6ms vs 1.7ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   uint8      num_threads=12  1.7X  3.5ms vs 2.1ms

(1, 3, 32, 512, 512) -> (16, 320, 320)  linear          float32    num_threads=32  27X   48.3ms vs 1.8ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         float32    num_threads=32  1.1X  2.2ms vs 2.0ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest         uint8      num_threads=32  2.6X  2.1ms vs 0.8ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   float32    num_threads=32  2.4X  2.3ms vs 0.9ms
(1, 3, 32, 512, 512) -> (16, 320, 320)  nearest-exact   uint8      num_threads=32  2.6X  2.2ms vs 0.8ms

```

</details>

Code:

<details>

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((16, 320, 320), (8, 256, 256)),
        ((16, 320, 320), (32, 512, 512)),
    )

    attrs = []
    for (DHW1, DHW2) in sizes:
        attrs.append([(1, 3, *DHW1), DHW2])
        attrs.append([(1, 3, *DHW2), DHW1])

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

```py
import re
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("f3", nargs="?", default="main")
parser.add_argument("f2", nargs="?", default="new")
args = parser.parse_args()

with open(args.f1) as f:
    main = f.readlines()
with open(args.f2) as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    # num_threads=1  # TODO: remove
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 5 == 0:
        print()
    # if i % 10 == 0 and i % 40 != 0:
    #     print()
    # if i % 40 == 0:
    #     print("-" * 100)
    print(l)
```

</details >

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90302
Approved by: https://github.com/vfdev-5, https://github.com/fmassa
2022-12-14 19:21:02 +00:00
0cd69d7cda Revert "[functorch] Refactor life handle storage (#90317)"
This reverts commit 4d494986af5201a0c487a9b7f3c68cfa6c4e28d0.

Reverted https://github.com/pytorch/pytorch/pull/90317 on behalf of https://github.com/osalpekar due to Causing contbuilds to fail when pytorch is built with -Wsign-compare internally - details in [D42019543](https://www.internalfb.com/diff/D42019543)
2022-12-14 19:08:33 +00:00
3c637e8007 fix aot autograd for None fw inputs (#89975)
hot fix: Confirmed this fixes an internal model that had None as one if its inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89975
Approved by: https://github.com/aazzolini
2022-12-14 18:44:08 +00:00
e9dc8cc19b Add torch.compile support to minifier (#90308)
Initial fix for https://github.com/pytorch/torchdynamo/issues/1964.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90308
Approved by: https://github.com/mlazos
2022-12-14 18:24:42 +00:00
fde5646f3d Inductor cpp wrapper: support bmm, mm, addmm extern call (#88667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88667
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-14 18:19:27 +00:00
51c6c5e156 [SDPA] Standardizes the return shape for dense tensor of SDPA regardless of fused kernel called (#90776)
# Summary
Continues to fix up the meta output story of SDPA to be more correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90776
Approved by: https://github.com/cpuhrsch
2022-12-14 18:08:02 +00:00
caa05e6f87 Give linting steps a unique prefix (#90705)
Give a unique prefix to all steps in lint.yml which catch valid linter errors. This will let retrybot identify lint.yml steps which should not be retried.

This is a prelude to https://github.com/pytorch/test-infra/pull/1275 which extends the retry-on-failure behavior to all PRs in addition to trunk.

This hadn't been an issue previously since we would always only linter failures on `master`, where linter failures were always safe to retry since legitimate linter failures there are virtually non-existent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90705
Approved by: https://github.com/huydhn, https://github.com/malfet
2022-12-14 17:38:14 +00:00
f21cb7d77e [pyfunctorch] Generate a more meaningful name for _SingleLevelAutogradFunction (#90418)
The API to do this is not pretty, but at least it works.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90418
Approved by: https://github.com/soulitzer
2022-12-14 16:20:57 +00:00
da42eab48b Fix circular import in torch/autograd/function.py (#90415)
It turns out it is possible to break cycles by not directly importing a
module:
- there's a problem that torch.jit imports torch._ops and torch._ops
import torch.jit
- there's another problem that torch.autograd.function imports
custom_function_call but torch._functorch.autograd_function imports
torch.autograd.function

The "better" way to handle all of this is to do some large refactoring so
that torch._functorch.autograd_function imports some file that has
_SingleLevelAutogradFunction and then have torch.autograd.function
depend on torch.functorch.autograd_function... (and ditto for torch.jit
vs torch._ops), but I'm scared to move code around too much for BC
reasons and the fix in this PR works well.

Test Plan:
- import torch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90415
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-12-14 16:20:57 +00:00
4809e838c1 functorch.jvp support for autograd.Function (#90077)
This PR adds functorch.jvp support for autograd.Function. It does so by
adding a jvp rule for custom_function_call.

For a regular PyTorch operation (like at::sin), the VariableType kernel:
- re-dispatches to at::sin
- calls the jvp rule for at::sin

The jvp rule for custom_function_call does just that. It constructs a
new autograd.Function (because the above logic already exists). Inside
the forward, it re-dispatches to custom_function_call. In the jvp rule,
it just calls whatever the jvp rule is supposed to be.

Since this logic is really close to the custom_function_call_grad, I
just put them together.

Test Plan:
- added jvp rules to the autograd.Function in autograd_function_db
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90077
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-12-14 16:20:53 +00:00
dcb73aa291 Run inductor benchmark test for every PR (#90773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90773
Approved by: https://github.com/huydhn
2022-12-14 14:43:14 +00:00
cc4131a815 Inductor cpp wrapper: support more dtypes of input (#88666)
Previously only float32 is supported as input types for the cpp wrapper.
This PR extends the cpp wrapper to support the built-in types: float32, float64, int64, int32, int16, int8, uint8, bool.
Bfloat16 and Float16 will be covered later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88666
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-14 14:30:13 +00:00
ba77afbce1 Move _test_inductor_realize into python (#90517)
Addresses https://github.com/pytorch/pytorch/pull/90014/files#r1043625932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90517
Approved by: https://github.com/ngimel
2022-12-14 12:40:00 +00:00
d35aa2f65a Inductor cpp wrapper: support Reduction (#88561)
For reductions, the code string in the codegen stage and the execution stage are different due to `\`.

- The code string gotten from `code.getvalue()` (`code` is an `IndentedBuffer`) in codegen stage:
  ```
  #pragma omp declare reduction(argmax : struct IndexValue_1 :\
                  omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,\
                  omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)\
                  initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
  ```

- The code string loaded during the execution (`\` will be escaped):
  ```
  #pragma omp declare reduction(argmax : struct IndexValue_1 :                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)                  initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
  ```

Thus we can't get the same hash value for these two pieces of code.
This PR adds a function to make the transformation escape the backslash in the codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88561
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-12-14 12:29:50 +00:00
7963dbf3db symbolic-shapes: -anjali411, +jbschlosser (#90816)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90816
Approved by: https://github.com/SherlockNoMad
2022-12-14 10:14:46 +00:00
1aab755320 Fakify params and weights under private config (#90417)
Previously, we planned to lift the parameters and weights while exporting and implement our own transformer to "unlift" the lifted weights and params back to the graph as attributes. But this is bit challenging because:

- We need to maintain correct ordering for weights and parameters that are passed as inputs so that we know how to map them back.
- Some weights are unused in the graph, so our transformer needs to be aware of which weights and parameters are not used in the graph. And we need to distinguish which are real user input and which are parameters.
- There can be more edge cases we haven't seen in other models yet.

I am aware that @Chillee  and @bdhirsh mentioned that functionalization won't work with fake-tensor attributes but this is fine for the short term as we don't expect users to be modifying weights and params in inference mode. In fact, we explicitly disable attribute mutation in torchdynamo export mode right now.

Given above condition, it might be ok to just fakify params when we need. I use a flag to guard against this change.

Differential Revision: [D41891201](https://our.internmc.facebook.com/intern/diff/D41891201)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90417
Approved by: https://github.com/eellison
2022-12-14 09:33:18 +00:00
3870a9e28d to_sparse_XXX: backward support (#90281)
As per title. Fixes https://github.com/pytorch/pytorch/issues/85226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90281
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2022-12-14 09:05:17 +00:00
708108a9d3 Optimized vertical flip using memcpy (#89414)
## Description

- Use memcpy for vertical flip
- Added bool type support for horizontal flip
  - channels last input with horizontal flip goes also into cpu_vflip_memcpy and has a speed-up

Previous PRs:
- https://github.com/pytorch/pytorch/pull/90013
- https://github.com/pytorch/pytorch/pull/88989

## Results

### Horizontal flip

- AVX2 (only cases with speed-up or same perfs for channels last input)
```
[------------------------------------------------------------------------- Horizontal flip -------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)   |  torch (1.14.0a0+gitb0bd5c4) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |        204.813 (+-1.018)         |                     |           308.070 (+-1.573)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        844.523 (+-2.302)         |                     |           1226.801 (+-5.069)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |        2246.512 (+-8.935)        |                     |          2689.692 (+-22.654)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         21.024 (+-0.083)         |   44.196 (+-0.131)  |            22.564 (+-0.066)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         71.806 (+-0.150)         |  166.653 (+-0.789)  |            72.660 (+-0.160)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        129.354 (+-0.385)         |  306.998 (+-0.819)  |           130.094 (+-0.274)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |        177.250 (+-0.485)         |   44.232 (+-0.465)  |           289.201 (+-2.837)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |        699.055 (+-1.940)         |  166.540 (+-0.903)  |           1172.747 (+-3.645)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |        1302.968 (+-5.390)        |  307.210 (+-0.852)  |          2149.396 (+-23.570)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         11.943 (+-0.079)         |                     |            12.451 (+-0.033)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         39.830 (+-0.093)         |                     |            40.583 (+-0.070)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         69.001 (+-0.078)         |                     |            69.590 (+-0.162)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |        177.378 (+-0.507)         |                     |           283.461 (+-2.957)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |        698.915 (+-1.840)         |                     |          1061.208 (+-10.449)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |        1299.365 (+-3.919)        |                     |          1957.424 (+-13.149)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         17.955 (+-0.077)         |                     |            89.456 (+-0.285)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         56.901 (+-0.081)         |                     |           339.802 (+-0.879)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        103.629 (+-0.256)         |                     |           627.845 (+-1.185)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         21.179 (+-0.077)         |   44.146 (+-0.260)  |            22.957 (+-0.138)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         71.685 (+-0.155)         |  166.666 (+-0.730)  |            72.606 (+-0.124)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        129.168 (+-0.288)         |  307.094 (+-1.571)  |           130.156 (+-0.453)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         33.049 (+-0.089)         |                     |            33.056 (+-0.477)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |        116.635 (+-0.299)         |                     |           113.433 (+-0.891)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |        212.134 (+-0.413)         |                     |           204.394 (+-0.822)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |        207.214 (+-0.586)         |                     |           302.370 (+-0.670)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        846.553 (+-2.301)         |                     |           1223.851 (+-5.280)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |        2251.687 (+-6.513)        |                     |          2711.557 (+-14.011)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         33.237 (+-0.072)         |                     |            33.101 (+-0.070)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |        113.605 (+-0.337)         |                     |           117.067 (+-0.547)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |        204.632 (+-0.487)         |                     |           212.590 (+-0.848)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         7.950 (+-0.030)          |                     |            37.757 (+-0.080)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         23.799 (+-0.080)         |                     |           136.571 (+-0.441)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         37.970 (+-0.075)         |                     |           246.894 (+-0.926)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         8.009 (+-0.077)          |                     |            37.800 (+-0.100)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         23.861 (+-0.099)         |                     |           136.553 (+-0.519)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         38.211 (+-0.104)         |                     |           246.939 (+-0.692)

Times are in microseconds (us).
```
[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100405-pr_vs_nightly-md)

- AVX512 (only cases with speed-up or same perfs for channels last input)
```
[---------------------------------------------------------------------------- Horizontal flip ----------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)    |  torch (1.14.0.dev20221208+cu116) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |        194.708 (+-9.566)         |                      |             372.067 (+-12.430)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        765.151 (+-10.098)        |                      |            1524.231 (+-111.283)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |       1587.229 (+-88.117)        |                      |            2950.081 (+-92.322)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         13.328 (+-0.375)         |   49.693 (+-1.193)   |              10.323 (+-0.333)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         90.580 (+-0.812)         |  191.936 (+-4.369)   |              92.269 (+-0.980)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        163.821 (+-3.174)         |  352.053 (+-10.909)  |             165.661 (+-4.436)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |        206.862 (+-4.417)         |   49.336 (+-1.492)   |             287.373 (+-7.266)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |        829.736 (+-15.857)        |  191.489 (+-5.645)   |            1166.126 (+-45.667)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |       1540.953 (+-28.269)        |  352.171 (+-8.784)   |            2171.570 (+-82.740)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         7.856 (+-0.131)          |                      |              7.943 (+-0.148)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         34.750 (+-1.195)         |                      |              36.309 (+-0.716)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         85.858 (+-0.729)         |                      |              87.306 (+-0.981)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |        206.896 (+-5.716)         |                      |             262.551 (+-6.598)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |        828.212 (+-13.441)        |                      |            1077.916 (+-28.810)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |       1542.748 (+-31.379)        |                      |            2003.661 (+-71.614)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         11.038 (+-0.271)         |                      |             126.867 (+-5.590)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         90.190 (+-1.185)         |                      |             501.446 (+-13.498)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        165.797 (+-3.016)         |                      |             921.131 (+-20.500)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         13.516 (+-0.578)         |   49.678 (+-1.966)   |              10.360 (+-0.256)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         91.195 (+-0.830)         |  191.778 (+-4.742)   |              91.117 (+-0.855)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        168.551 (+-3.352)         |  351.585 (+-8.230)   |             164.199 (+-3.725)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         35.832 (+-0.840)         |                      |              35.087 (+-0.972)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |        133.624 (+-5.293)         |                      |             131.423 (+-6.002)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |        240.702 (+-5.213)         |                      |             236.876 (+-7.867)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |        192.351 (+-6.740)         |                      |             313.999 (+-12.141)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        766.553 (+-16.669)        |                      |            1270.797 (+-49.828)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |       1501.700 (+-69.499)        |                      |            2427.303 (+-126.694)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         35.386 (+-0.801)         |                      |              34.539 (+-0.844)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |        132.369 (+-4.107)         |                      |             130.926 (+-3.597)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |        237.722 (+-6.680)         |                      |             237.072 (+-5.027)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         6.796 (+-0.132)          |                      |              44.727 (+-0.905)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         24.827 (+-0.669)         |                      |             166.758 (+-5.141)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         42.392 (+-0.980)         |                      |             310.830 (+-6.130)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         8.114 (+-0.141)          |                      |              44.776 (+-0.707)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         24.787 (+-0.787)         |                      |             167.766 (+-5.004)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         42.545 (+-0.636)         |                      |             313.715 (+-7.603)

Times are in microseconds (us).
```
[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105633-pr_vs_nightly-avx512-md)

### Vertical flip

- AVX2 (all tested cases showing speed-up or same perfs)
```
[-------------------------------------------------------------------------- Vertical flip --------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)   |  torch (1.14.0a0+gitb0bd5c4) nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |         93.125 (+-3.022)         |                     |           101.064 (+-0.436)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        412.942 (+-57.066)        |                     |           461.463 (+-2.098)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |        1533.265 (+-4.071)        |                     |          1829.713 (+-14.311)

      channels=3, size=256, dtype=torch.int64, mf=channels_first      |        101.134 (+-0.924)         |                     |           102.858 (+-0.319)
      channels=3, size=520, dtype=torch.int64, mf=channels_first      |        421.679 (+-1.101)         |                     |           477.413 (+-1.809)
      channels=3, size=712, dtype=torch.int64, mf=channels_first      |        1550.418 (+-3.647)        |                     |           1877.143 (+-6.622)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         20.961 (+-0.063)         |   19.515 (+-0.302)  |            21.980 (+-0.070)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         71.199 (+-0.173)         |   70.199 (+-0.332)  |            95.262 (+-0.109)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        128.532 (+-0.318)         |  127.325 (+-0.328)  |           167.190 (+-0.370)

      channels=1, size=256, dtype=torch.int32, mf=channels_first      |         21.206 (+-0.059)         |   19.471 (+-0.128)  |            21.469 (+-0.064)
      channels=1, size=520, dtype=torch.int32, mf=channels_first      |         71.284 (+-0.163)         |   70.124 (+-0.388)  |            94.988 (+-0.239)
      channels=1, size=712, dtype=torch.int32, mf=channels_first      |        129.017 (+-0.286)         |  128.088 (+-0.461)  |           167.115 (+-1.075)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |         16.909 (+-0.057)         |   19.570 (+-0.353)  |            17.981 (+-0.072)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |         55.163 (+-0.138)         |   70.218 (+-0.275)  |           107.938 (+-0.620)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |         98.518 (+-0.121)         |  127.737 (+-0.486)  |           170.965 (+-0.436)

      channels=3, size=256, dtype=torch.uint8, mf=channels_first      |         18.150 (+-0.084)         |   19.758 (+-0.221)  |            18.122 (+-0.088)
      channels=3, size=520, dtype=torch.uint8, mf=channels_first      |         56.693 (+-0.200)         |   70.278 (+-0.386)  |            89.018 (+-0.206)
      channels=3, size=712, dtype=torch.uint8, mf=channels_first      |        100.409 (+-0.235)         |  127.772 (+-0.457)  |           168.072 (+-0.436)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         12.817 (+-0.041)         |                     |            12.818 (+-0.049)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         38.359 (+-0.081)         |                     |            63.378 (+-0.165)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         68.246 (+-0.090)         |                     |           116.637 (+-0.583)

      channels=1, size=256, dtype=torch.int16, mf=channels_first      |         12.899 (+-0.054)         |                     |            12.649 (+-0.060)
      channels=1, size=520, dtype=torch.int16, mf=channels_first      |         38.404 (+-0.069)         |                     |            63.448 (+-0.108)
      channels=1, size=712, dtype=torch.int16, mf=channels_first      |         68.378 (+-0.104)         |                     |           116.415 (+-0.332)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |         17.071 (+-0.044)         |                     |            17.792 (+-0.050)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |         55.163 (+-0.100)         |                     |           108.539 (+-0.466)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |         98.537 (+-0.091)         |                     |           171.675 (+-0.553)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         17.837 (+-0.071)         |                     |            18.355 (+-0.067)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         56.051 (+-0.087)         |                     |            88.261 (+-0.129)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        100.603 (+-0.245)         |                     |           169.067 (+-0.430)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         21.204 (+-0.063)         |   19.607 (+-0.140)  |            22.202 (+-0.094)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         71.356 (+-0.211)         |   69.844 (+-0.343)  |            94.614 (+-0.167)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        129.087 (+-0.290)         |  127.065 (+-0.319)  |           166.513 (+-0.444)

      channels=1, size=256, dtype=torch.float32, mf=channels_first    |         21.196 (+-0.065)         |   19.156 (+-0.132)  |            21.516 (+-0.073)
      channels=1, size=520, dtype=torch.float32, mf=channels_first    |         71.422 (+-0.180)         |   70.296 (+-0.136)  |            94.913 (+-0.095)
      channels=1, size=712, dtype=torch.float32, mf=channels_first    |        129.045 (+-0.312)         |  128.023 (+-0.585)  |           166.089 (+-0.409)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         12.770 (+-0.045)         |                     |            34.853 (+-0.089)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |         38.363 (+-0.064)         |                     |           131.969 (+-0.577)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |         67.954 (+-0.107)         |                     |           239.507 (+-0.835)

      channels=1, size=256, dtype=torch.float16, mf=channels_first    |         12.855 (+-0.067)         |                     |            35.124 (+-0.109)
      channels=1, size=520, dtype=torch.float16, mf=channels_first    |         38.725 (+-0.079)         |                     |           131.708 (+-0.586)
      channels=1, size=712, dtype=torch.float16, mf=channels_first    |         68.931 (+-0.086)         |                     |           239.022 (+-0.914)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |         90.277 (+-0.083)         |                     |           101.512 (+-0.285)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        421.277 (+-1.030)         |                     |           471.913 (+-3.654)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |        1534.394 (+-7.572)        |                     |          1833.262 (+-12.185)

      channels=3, size=256, dtype=torch.float64, mf=channels_first    |        100.809 (+-0.328)         |                     |           103.166 (+-0.335)
      channels=3, size=520, dtype=torch.float64, mf=channels_first    |        425.535 (+-0.926)         |                     |           482.606 (+-1.450)
      channels=3, size=712, dtype=torch.float64, mf=channels_first    |        1550.832 (+-3.547)        |                     |           1859.098 (+-6.517)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         12.954 (+-0.051)         |                     |            12.744 (+-0.046)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |         41.180 (+-0.064)         |                     |            63.362 (+-0.139)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |         68.136 (+-0.142)         |                     |           117.009 (+-0.292)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_first   |         13.049 (+-0.052)         |                     |            12.792 (+-0.076)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_first   |         38.488 (+-0.092)         |                     |            63.451 (+-0.096)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_first   |         68.103 (+-0.091)         |                     |           116.693 (+-0.290)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         7.572 (+-0.029)          |                     |            8.017 (+-0.071)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         22.121 (+-0.061)         |                     |            23.614 (+-0.074)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         36.896 (+-0.094)         |                     |            39.460 (+-0.084)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         7.671 (+-0.028)          |                     |            8.034 (+-0.058)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         21.989 (+-0.053)         |                     |            23.645 (+-0.063)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         37.252 (+-0.072)         |                     |            39.477 (+-0.100)

      channels=1, size=256, dtype=torch.complex64, mf=channels_last   |         37.129 (+-0.052)         |                     |            37.801 (+-0.101)
      channels=1, size=520, dtype=torch.complex64, mf=channels_last   |        122.646 (+-0.230)         |                     |           139.074 (+-0.467)
      channels=1, size=712, dtype=torch.complex64, mf=channels_last   |        228.946 (+-0.736)         |                     |           257.589 (+-0.545)

      channels=1, size=256, dtype=torch.complex64, mf=channels_first  |         37.088 (+-0.070)         |                     |            37.894 (+-0.078)
      channels=1, size=520, dtype=torch.complex64, mf=channels_first  |        122.695 (+-0.268)         |                     |           138.933 (+-0.336)
      channels=1, size=712, dtype=torch.complex64, mf=channels_first  |        234.655 (+-0.454)         |                     |           255.787 (+-0.530)

Times are in microseconds (us).
```
[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-100440-pr_vs_nightly-md)

- AVX512 (all tested cases showing speed-up or same perfs)

```
[---------------------------------------------------------------------------- Vertical flip -----------------------------------------------------------------------------]
                                                                      |  torch (1.14.0a0+giteb3e189) PR  |    Pillow (9.3.0)   |  torch (1.14.0.dev20221208+cu116) nightly
1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64, mf=channels_last       |        122.544 (+-1.962)         |                     |             129.161 (+-1.809)
      channels=3, size=520, dtype=torch.int64, mf=channels_last       |        508.274 (+-4.790)         |                     |             533.872 (+-7.457)
      channels=3, size=712, dtype=torch.int64, mf=channels_last       |        951.176 (+-29.534)        |                     |            1073.603 (+-44.676)

      channels=3, size=256, dtype=torch.int64, mf=channels_first      |        127.872 (+-2.700)         |                     |             127.326 (+-2.666)
      channels=3, size=520, dtype=torch.int64, mf=channels_first      |        518.019 (+-4.157)         |                     |             538.094 (+-6.600)
      channels=3, size=712, dtype=torch.int64, mf=channels_first      |       1002.176 (+-42.545)        |                     |            1033.989 (+-42.137)

      channels=1, size=256, dtype=torch.int32, mf=channels_last       |         10.025 (+-0.135)         |   10.054 (+-0.369)  |              10.155 (+-0.285)
      channels=1, size=520, dtype=torch.int32, mf=channels_last       |         89.867 (+-0.994)         |   88.712 (+-0.622)  |             103.029 (+-2.254)
      channels=1, size=712, dtype=torch.int32, mf=channels_last       |        161.787 (+-2.080)         |  161.370 (+-1.801)  |             182.608 (+-7.031)

      channels=1, size=256, dtype=torch.int32, mf=channels_first      |         10.005 (+-0.277)         |   9.965 (+-0.338)   |              10.604 (+-0.334)
      channels=1, size=520, dtype=torch.int32, mf=channels_first      |         89.116 (+-0.996)         |   88.840 (+-0.608)  |             102.103 (+-2.111)
      channels=1, size=712, dtype=torch.int32, mf=channels_first      |        164.328 (+-3.284)         |  161.538 (+-2.739)  |             181.702 (+-3.770)

      channels=3, size=256, dtype=torch.uint8, mf=channels_last       |         8.853 (+-0.148)          |   10.292 (+-0.494)  |              8.961 (+-0.190)
      channels=3, size=520, dtype=torch.uint8, mf=channels_last       |         68.368 (+-1.158)         |   90.068 (+-1.780)  |              81.155 (+-0.945)
      channels=3, size=712, dtype=torch.uint8, mf=channels_last       |        125.458 (+-2.511)         |  163.150 (+-2.532)  |             147.039 (+-4.264)

      channels=3, size=256, dtype=torch.uint8, mf=channels_first      |         10.409 (+-0.435)         |   10.406 (+-0.351)  |              10.263 (+-0.252)
      channels=3, size=520, dtype=torch.uint8, mf=channels_first      |         69.077 (+-1.062)         |   90.057 (+-0.992)  |              79.910 (+-0.884)
      channels=3, size=712, dtype=torch.uint8, mf=channels_first      |        127.286 (+-2.789)         |  162.862 (+-2.953)  |             142.821 (+-2.119)

      channels=1, size=256, dtype=torch.int16, mf=channels_last       |         7.513 (+-0.143)          |                     |              7.364 (+-0.154)
      channels=1, size=520, dtype=torch.int16, mf=channels_last       |         33.140 (+-0.779)         |                     |              42.141 (+-0.820)
      channels=1, size=712, dtype=torch.int16, mf=channels_last       |         86.235 (+-1.187)         |                     |             104.205 (+-2.205)

      channels=1, size=256, dtype=torch.int16, mf=channels_first      |         7.410 (+-0.162)          |                     |              7.075 (+-0.126)
      channels=1, size=520, dtype=torch.int16, mf=channels_first      |         33.656 (+-0.914)         |                     |              40.991 (+-0.893)
      channels=1, size=712, dtype=torch.int16, mf=channels_first      |         86.087 (+-1.191)         |                     |             105.419 (+-1.801)

      channels=3, size=256, dtype=torch.int8, mf=channels_last        |         8.802 (+-0.196)          |                     |              8.627 (+-0.202)
      channels=3, size=520, dtype=torch.int8, mf=channels_last        |         66.348 (+-0.775)         |                     |              80.631 (+-1.832)
      channels=3, size=712, dtype=torch.int8, mf=channels_last        |        126.275 (+-2.318)         |                     |             144.597 (+-4.242)

      channels=3, size=256, dtype=torch.int8, mf=channels_first       |         10.255 (+-0.383)         |                     |              10.101 (+-0.335)
      channels=3, size=520, dtype=torch.int8, mf=channels_first       |         68.124 (+-0.849)         |                     |              79.286 (+-0.748)
      channels=3, size=712, dtype=torch.int8, mf=channels_first       |        127.118 (+-2.225)         |                     |             142.029 (+-2.507)

      channels=1, size=256, dtype=torch.float32, mf=channels_last     |         9.850 (+-0.453)          |   9.299 (+-0.253)   |              10.030 (+-0.234)
      channels=1, size=520, dtype=torch.float32, mf=channels_last     |         91.506 (+-1.319)         |   90.265 (+-0.824)  |             107.570 (+-2.093)
      channels=1, size=712, dtype=torch.float32, mf=channels_last     |        167.820 (+-3.883)         |  162.871 (+-2.397)  |             180.046 (+-8.952)

      channels=1, size=256, dtype=torch.float32, mf=channels_first    |         10.118 (+-0.359)         |   10.433 (+-0.479)  |              10.204 (+-0.344)
      channels=1, size=520, dtype=torch.float32, mf=channels_first    |         90.862 (+-1.486)         |   90.138 (+-0.969)  |             107.011 (+-1.801)
      channels=1, size=712, dtype=torch.float32, mf=channels_first    |        163.931 (+-3.653)         |  163.155 (+-2.673)  |             186.707 (+-2.248)

      channels=1, size=256, dtype=torch.float16, mf=channels_last     |         7.304 (+-0.134)          |                     |              24.141 (+-0.444)
      channels=1, size=520, dtype=torch.float16, mf=channels_last     |         35.186 (+-0.656)         |                     |             101.523 (+-1.465)
      channels=1, size=712, dtype=torch.float16, mf=channels_last     |         85.707 (+-0.841)         |                     |             192.640 (+-4.942)

      channels=1, size=256, dtype=torch.float16, mf=channels_first    |         7.286 (+-0.142)          |                     |              24.155 (+-0.555)
      channels=1, size=520, dtype=torch.float16, mf=channels_first    |         33.819 (+-1.009)         |                     |             101.620 (+-3.034)
      channels=1, size=712, dtype=torch.float16, mf=channels_first    |         84.811 (+-0.993)         |                     |             192.286 (+-4.707)

      channels=3, size=256, dtype=torch.float64, mf=channels_last     |        126.273 (+-2.519)         |                     |             128.831 (+-1.975)
      channels=3, size=520, dtype=torch.float64, mf=channels_last     |        551.861 (+-4.159)         |                     |             517.343 (+-4.501)
      channels=3, size=712, dtype=torch.float64, mf=channels_last     |       1102.465 (+-66.427)        |                     |            1224.532 (+-55.656)

      channels=3, size=256, dtype=torch.float64, mf=channels_first    |        129.965 (+-2.083)         |                     |             130.709 (+-2.261)
      channels=3, size=520, dtype=torch.float64, mf=channels_first    |        526.332 (+-5.354)         |                     |             515.399 (+-4.320)
      channels=3, size=712, dtype=torch.float64, mf=channels_first    |       1169.215 (+-78.889)        |                     |            1102.536 (+-51.178)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_last    |         7.478 (+-0.147)          |                     |              7.154 (+-0.162)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_last    |         33.836 (+-1.022)         |                     |              38.854 (+-0.648)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_last    |         85.483 (+-0.582)         |                     |              99.190 (+-2.202)

      channels=1, size=256, dtype=torch.bfloat16, mf=channels_first   |         7.416 (+-0.125)          |                     |              7.169 (+-0.121)
      channels=1, size=520, dtype=torch.bfloat16, mf=channels_first   |         34.958 (+-0.717)         |                     |              40.136 (+-0.784)
      channels=1, size=712, dtype=torch.bfloat16, mf=channels_first   |         85.505 (+-1.207)         |                     |              99.793 (+-2.065)

      channels=1, size=256, dtype=torch.bool, mf=channels_last        |         5.856 (+-0.178)          |                     |              5.824 (+-0.118)
      channels=1, size=520, dtype=torch.bool, mf=channels_last        |         12.030 (+-0.330)         |                     |              14.478 (+-0.554)
      channels=1, size=712, dtype=torch.bool, mf=channels_last        |         30.116 (+-0.639)         |                     |              31.163 (+-0.873)

      channels=1, size=256, dtype=torch.bool, mf=channels_first       |         5.804 (+-0.113)          |                     |              5.825 (+-0.102)
      channels=1, size=520, dtype=torch.bool, mf=channels_first       |         12.043 (+-0.363)         |                     |              14.240 (+-0.341)
      channels=1, size=712, dtype=torch.bool, mf=channels_first       |         30.001 (+-1.001)         |                     |              33.199 (+-0.430)

      channels=1, size=256, dtype=torch.complex64, mf=channels_last   |         29.941 (+-0.861)         |                     |              28.229 (+-0.904)
      channels=1, size=520, dtype=torch.complex64, mf=channels_last   |        173.244 (+-2.577)         |                     |             173.173 (+-2.260)
      channels=1, size=712, dtype=torch.complex64, mf=channels_last   |        323.548 (+-3.338)         |                     |             318.318 (+-2.764)

      channels=1, size=256, dtype=torch.complex64, mf=channels_first  |         29.001 (+-1.029)         |                     |              28.565 (+-2.074)
      channels=1, size=520, dtype=torch.complex64, mf=channels_first  |        173.078 (+-1.993)         |                     |             170.664 (+-1.722)
      channels=1, size=712, dtype=torch.complex64, mf=channels_first  |        324.782 (+-3.759)         |                     |             315.745 (+-2.600)

Times are in microseconds (us).

```

[Source](https://gist.github.com/vfdev-5/c2ca615b522aeb1c4636dc8d948fec74#file-20221209-105707-pr_vs_nightly-avx512-md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89414
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2022-12-14 08:19:07 +00:00
e54c6c2870 Fix non-existing parameters in docstrings in torch/onnx (#90593)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90593
Approved by: https://github.com/justinchuby
2022-12-14 07:49:14 +00:00
37cd96a6fe inductor: using pre-existing fake mode to fallback kernels (#90814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90814
Approved by: https://github.com/ngimel, https://github.com/jgong5
2022-12-14 07:42:43 +00:00
6c8ef6a4c2 Add tracing context, Integrate dynamo guards into torch._guards (#90647)
As defined here: https://docs.google.com/document/d/1oniZEgAaHE1IMByPRWRKbUHeaW06E2HMfCTCQyMRLek/edit#

This PR creates a new structure, a TracingContext, whose lifecycle matches that of the traced frame. It carries on it a GuardsContext, and eventually, a FakeTensorMode. It is the source of truth of all accumulated guards.

In this PR, we create the structure, and integrate it into dynamo. We do so by mapping OutputGraph's guards structure to its guard structure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90647
Approved by: https://github.com/ezyang
2022-12-14 07:35:32 +00:00
f4099af1e9 Fix gradcheck for BSR and BSC inputs. (#90719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90719
Approved by: https://github.com/soulitzer, https://github.com/cpuhrsch
2022-12-14 05:37:05 +00:00
a60d712010 Support (non-batch) BSR/BSC to COO sparse tensor conversions (#90718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90718
Approved by: https://github.com/cpuhrsch
2022-12-14 05:37:05 +00:00
cc504ce292 Restore test_warn_types (#90810)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90810
Approved by: https://github.com/ngimel
2022-12-14 05:15:32 +00:00
e8e591b72f Upgrade CI to ROCm5.3 (#88297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88297
Approved by: https://github.com/malfet
2022-12-14 05:09:56 +00:00
258860fa3a [ao][fx] fixing public v private for pattern_utils.py (#88397)
Summary: made _DEFAULT_FUSION_PATTERNS,
_register_fusion_pattern,
_DEFAULT_QUANTIZATION_PATTERNS,
_DEFAULT_OUTPUT_FAKE_QUANTIZE_MAP,
_DEFAULT_OUTPUT_OBSERVER_MAP,
_register_quant_pattern,
_sorted_patterns_dict private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015537](https://our.internmc.facebook.com/intern/diff/D41015537)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88397
Approved by: https://github.com/jcaip
2022-12-14 03:40:02 +00:00
769392178a [vision hash update] update the pinned vision hash (#90727)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90727
Approved by: https://github.com/pytorchbot
2022-12-14 03:31:44 +00:00
e87370133c Include dispatch key in wrapper symbol name (#90674)
When looking at gdb traces, this makes it easier to tell that
you're looking at the CPU wrapper vs CUDA wrapper, etc.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90674
Approved by: https://github.com/ngimel
2022-12-14 03:09:22 +00:00
6c605e9c3d [FSDP] Skip param check for pure FP16 (#90785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90785
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2022-12-14 02:35:16 +00:00
e2e4a80cdb Inductor cpp wrapper: support None as output (#88560)
Map `None` to `at::Tensor()` in the cpp wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88560
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-12-14 02:28:22 +00:00
93aee0cdc9 [FSDP][Easy] ufmt files (#90548)
```
ufmt format torch/distributed/fsdp
ufmt format test/distributed/fsdp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90548
Approved by: https://github.com/mrshenli, https://github.com/rohan-varma
2022-12-14 02:02:53 +00:00
e90169d174 Fix missing return statement for test_it_returns_empty_list_when_model_contains_supported_inplace_ops in #89299 (#90797)
Follow-up to #89299 where the return statement is missing in the test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90797
Approved by: https://github.com/malfet
2022-12-14 01:45:31 +00:00
510339c07b [FSDP][2/N] Refactor state dict hook registration (#90777)
This PR includes some follow-ups from the previous PR to clean up the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90777
Approved by: https://github.com/rohan-varma
2022-12-14 01:13:19 +00:00
ed050e7a18 Small fixes for better channels last performance (#89616)
1) don't codegen maxpool backward, it's exceedingly slow
2) better determine reduction variables for more accurate hints
3) deterministic iteration order for reduction arguments, take into account all full size reduction argument, for hints break ties to outer reduction

fixes #1653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89616
Approved by: https://github.com/jansel, https://github.com/Chillee
2022-12-14 00:52:35 +00:00
dbe85265a8 Automated submodule update: kineto (#89846)
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: 72fa713ba6

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89846
Approved by: https://github.com/aaronenyeshi
2022-12-14 00:25:44 +00:00
d52f121dba [Composable API]Common _State parent class for composable and wrapper FSDP (#89147)
**Why this PR?**

For the composable APIs implementation, sometimes the internal APIs may not have the application (FSDP, DDP) root module but only the local module. One example is the state_dict/optimizer_state_dict implementation of FSDP. These APIs  are designed to start with the root module of the model. It is tricky for these APIs to tell whether a random submodule is managed by either DDP or FSDP.

It will be useful to have APIs like:
`_get_module_state(module)`: return the composable state if this module is managed by composable API.
`_get_module_fsdp_state(module)`: return the FSDP state if this module is managed by FSDP.

**What does this PR propose?**
1. Make `_State` out of `_composable` module so that `FullyShardedDataParallel` can inherit from it.
2. A global `_module_state_mapping: Dict[nn.Module, _State]` that keeps the mapping of all submodules (not just root module) to the state.
3. Create `_get_module_state(module)` to look up `_module_state_mapping`.
4. Create `_get_module_fsdp_state(module)` that uses `_get_module_state(module)` to get the state then verifies if the state is `_FSDPState`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89147
Approved by: https://github.com/awgu
2022-12-13 23:58:01 +00:00
b66cedd906 [FSDP] Fix use_orig_params=True + no_sync() (#90546)
`no_sync()` introduces a separate case where a `FlatParameter` maintains an _unsharded_ gradient, instead of a _sharded_ one. This PR fixes `no_sync()` with `use_orig_params=True` by dealing with this separate case.

The existing `use_orig_params=False` already bypasses the built-in parameter/gradient size check, where the `flat_param` is sharded, while the `flat_param.grad` is unsharded. For `use_orig_params=True`, we need to use the same `.data` hack to side step the size check that we used to side step the dtype check for `keep_low_precision_grads=True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90546
Approved by: https://github.com/rohan-varma
2022-12-13 23:40:04 +00:00
6d425a7ce9 Fix forward AD custom Function non-differentiable outputs (#90787)
Fixes https://github.com/pytorch/pytorch/issues/90067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90787
Approved by: https://github.com/albanD
2022-12-13 23:13:44 +00:00
9575f2ca83 [LTC] Make some LazyTensor interfaces virtual (#90686)
Summary:
Make some LazyTensor interfaces virtual such that XLA can adopt. It's related to https://github.com/pytorch/xla/pull/4317.

Test Plan:
CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90686
Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG
2022-12-13 21:38:07 +00:00
bf2668a899 Add support for kineto in memory viz (#90567)
This is just rudimentary initial support that does the same stuff as the trace profile. Follow will add category encodings to the tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90567
Approved by: https://github.com/robieta
2022-12-13 21:31:16 +00:00
b4b8a56589 Doc for Canonical Aten and Prims IR (#90644)
as title.

Sample output: https://docs-preview.pytorch.org/90644/ir.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90644
Approved by: https://github.com/ezyang
2022-12-13 21:30:47 +00:00
65e762acc8 [FSDP][optim_state_dict][5/N] Remove optim_inputs for sharded state_dict. (#89981)
The argument, `optim_inputs`, is being deprecated. Sharded optimizer state_dict APIs are not be used. It is safe to remove them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89981
Approved by: https://github.com/awgu
2022-12-13 21:05:04 +00:00
4a2d64994c [FSDP][optim_state_dict][4/N] Remove the unused _get_flat_param_to_fsdp_module API (#89980)
This is an easy PR, just remove an unused internal API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89980
Approved by: https://github.com/awgu
2022-12-13 21:01:46 +00:00
7cd900eb97 [fix] adaptive_{avg, max}_pool variants : cuda & cpu (#88906)
Fixes #78868

#### TODO
- [x] add tests
- [x] adaptive_avg_pool2d
- [x] adaptive_avg_pool3d
- [x] adaptive_max_pool2d
- [x] fix adaptive_max_pool3d_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88906
Approved by: https://github.com/mruberry
2022-12-13 20:57:00 +00:00
043de8d1b1 [FSDP][optim_state_dict][3/N] Support use_orig_param optim_state_dict (non-broadcast version) (#89900)
**What:**
This PR add the optim state_dict support of `use_orig_params` with rank0_only is False. rank0_only support will be added in a following PR. The design of this PR focus on the simplicity and may not have good performance, especially for optim state_dict loading. Since optim state_dict loading is only called once in the beginning of the training, performance is not the major concern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89900
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2022-12-13 20:45:21 +00:00
0a4e4de525 [ROCm] add case for FP32MatMulPattern skip property (#84077)
TF32 is not supported on ROCm and hence the torch/profiler/_pattern_matcher.py FP32MatMulPattern should return False for ROCm instead of checking the results of torch.cuda.get_arch_list().  Depending on the gfx arch running the test, test_profiler.py's test_profiler_fp32_matmul_pattern (__main__.TestExperimentalUtils) will fail otherwise.

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84077
Approved by: https://github.com/jeffdaily, https://github.com/kit1980
2022-12-13 20:27:35 +00:00
79156c11c3 [ao][fx] fixing public v private match_utils.py (#88396)
Summary: made _is_match, _find_matches, _MatchResult private also added
__all__ to lower_to_qnnpack.py

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015540](https://our.internmc.facebook.com/intern/diff/D41015540)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88396
Approved by: https://github.com/jcaip
2022-12-13 20:16:55 +00:00
a856557b3a [ao][fx] public v private convert.py (#88394)
Summary: made _restore_state,
_has_none_qconfig,
_run_weight_observers,
_maybe_recursive_remove_dequantize,
_get_module_path_and_prefix,
_insert_dequantize_node,
_maybe_get_observer_for_node,
_remove_previous_dequantize_in_custom_module private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015547](https://our.internmc.facebook.com/intern/diff/D41015547)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88394
Approved by: https://github.com/jcaip
2022-12-13 20:10:12 +00:00
b3d49c2fb8 [FSDP][1/N] fully_shard state dict (#90767)
Co-authored with @rohan-varma.

**Overview**
This adds preliminary `state_dict()` support for `fully_shard`.
- The only explicit branching between composable and wrapper code paths happens in the state dict hook registration, which is inevitable.
- We introduce a `_comm_module_prefix` to match the FQNs between the two code paths. This is needed since for composable, the FQNs are prefixed from the local FSDP root, whereas for state dict purposes, we want them to be prefixed from the comm. module. Thus, we need this `_comm_module_prefix` to be stripped during state dict.
    - In my understanding, the alternative to not use the `prefix` argument in `state_dict()` does not support the case when `fully_shard` is applied to a submodule (i.e. not the global root module) since we still need _part_ of `prefix` then.

**Follow-Ups**
- We can retire the `functools.partial` usage once @fegin's PR lands.
- We should add more thorough testing (e.g. sharded state dict, save and load together etc.).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90767
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2022-12-13 20:05:40 +00:00
ad4189c8db [reland][inductor] Update TIMM skip list (#90762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90762
Approved by: https://github.com/eellison
2022-12-13 19:56:23 +00:00
5c133c5744 [Dynamo] Supports two torch.distributed.* functions (#90683)
Fixes Meta internal user cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90683
Approved by: https://github.com/jansel
2022-12-13 19:06:38 +00:00
21fc28285e [stateless] fix functional call docs (#90476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90476
Approved by: https://github.com/zou3519
2022-12-13 18:23:22 +00:00
4a5f4416d0 Make at::outer SymInt-aware (#90714)
Fixes matmul and related ops with meta; no more xfails needed. The non-working case for matmul was the matrix-vector case, which dispatches to `outer`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90714
Approved by: https://github.com/lezcano
2022-12-13 18:16:09 +00:00
3f14c70576 Make functional inverse for squeeze_copy SymInt-aware (#90697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90697
Approved by: https://github.com/ezyang
2022-12-13 18:15:37 +00:00
1119d2fa54 Revert "Reland "Add heirachical module names to torchFX graph.node" (#90205)"
This reverts commit 6b7efac3c9ea5c9fbfb18069abd254ad7d9a103e.

Reverted https://github.com/pytorch/pytorch/pull/90205 on behalf of https://github.com/seemethere due to Reverting since this caused failures in internal systems, see https://fb.workplace.com/groups/802176577445480/posts/894284641568006 for discussion
2022-12-13 17:47:07 +00:00
1439ebd899 Enable inductor perf test on GCP A100 (#90322)
This PR tries to enable inductor performance nightly testing on A100 runner provided by GCP. Currently these GCP runners were created and maintained using scripts in https://github.com/fairinternal/pytorch-gha-infra/pull/82.
For some reason the artifacts cannot (and does not need to) be uploaded to S3, so adding use-gha parameter to _linux-test.yml to avoid creating a new but mostly identical _linux-test.yml.

Workflow test results: https://github.com/pytorch/pytorch/actions/runs/3642340544/jobs/6149691109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90322
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/desertfire
2022-12-13 17:47:01 +00:00
544756ae5e Fix mps constant pad (#89864)
Support arbitrary dimensions for constant padding on MPS

Fixes #89624
Fixes #87277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89864
Approved by: https://github.com/kulinseth, https://github.com/malfet
2022-12-13 17:28:54 +00:00
7035bcdd0f [inductor] Enable test_torch (#90518)
Summary: Skipping failures in those tests so that CI can guard other
passing cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90518
Approved by: https://github.com/jansel
2022-12-13 16:21:35 +00:00
0d5c849d48 Update cuSPARSE usage for CUDA 12.0 (#90765)
cuSPARSE v12.0 has started to use const pointers for the descriptors, from `cusparse.h` (documentation is incorrect):
```cpp
typedef struct cusparseSpVecDescr const* cusparseConstSpVecDescr_t;
typedef struct cusparseDnVecDescr const* cusparseConstDnVecDescr_t;
typedef struct cusparseSpMatDescr const* cusparseConstSpMatDescr_t;
typedef struct cusparseDnMatDescr const* cusparseConstDnMatDescr_t;
```
Changing also the function signature for the corresponding destructors to accept a const pointer. This PR adds `ConstCuSparseDescriptorDeleter` working with `cusparseStatus_t (*destructor)(const T*)`.

Some algorithm enums were deprecated during CUDA 11 and removed in CUDA 12, I replaced the following occurences
```
CUSPARSE_CSRMM_ALG1 -> CUSPARSE_SPMM_CSR_ALG1
CUSPARSE_COOMM_ALG1 -> CUSPARSE_SPMM_COO_ALG1
CUSPARSE_COOMM_ALG2 -> CUSPARSE_SPMM_COO_ALG2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90765
Approved by: https://github.com/cpuhrsch
2022-12-13 15:55:56 +00:00
d4dda519c9 Fix FSDP checkpoint tests (#90745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90745
Approved by: https://github.com/awgu
2022-12-13 15:34:25 +00:00
a76032d8f4 [inductor] Pattern match cat->view*->pointwise and hoist pointwise (#90743)
Summary:
Inductor can't fuse pointwise into the output of concat, but it can
fuse into the inputs, and that's the same thing.  So we hoist pointwise through
a concat (followed by an optional series of views).

Test Plan: New unit test

Differential Revision: D41901656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90743
Approved by: https://github.com/jiawenliu64, https://github.com/jansel
2022-12-13 15:18:01 +00:00
da8f539e84 [Fix]: Add missing std::vector reserve in aten and torch/csrc (#90627)
Applies some clang-tidy static analysis fixes to some places where the std::vector could call.reserve() first to allocate the appropriate amount of space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90627
Approved by: https://github.com/ezyang
2022-12-13 14:46:27 +00:00
4d494986af [functorch] Refactor life handle storage (#90317)
A "life handle" is a pointer-to-boolean that says whether or not a
TensorWrapper is alive. A TensorWrapper is alive if we are currently
inside of its corresponding transform. An Interpreter is alive if we are
currently inside of its corresponding transform. I.e., for vmap(f)(x),
the BatchedTensor(x, level=1) is alive inside of the execution of f; and
the corresponding VmapInterpreter is alive inside of f.

Previously, there was a global map of level to life handle. It is
possible to get into a state where we have multiple levels that refer to
different Interpreters (if the implementation of an operator calls into
functorch) and that messes up the global map.

This PR changes it so that
- every Interpreter holds a life handle that says if it is alive
- to construct a TensorWrapper, one must either (a) directly pass it a life
handle, or (b) one must create the TensorWrapper when the corresponding
Interpreter is on the stack (and we will automatically grab the life
handle by indexing into the DynamicLayerStack with the level)

(a) is more robust so I changed most of our C++ callsites to do that.
(b) feels a bit hacky to me, but it seems fine for now:
- It'll raise a nice error message if the interpreter isn't on the stack
- all of our Python callsites already follow this convention (we construct
TensorWrappers after pushing the Interpreter onto the stack).

The alternative to (b) is that we always do (a), which we can do in the
future if (b) runs us into any problems.

Test Plan:
- all functorch tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90317
Approved by: https://github.com/samdow
2022-12-13 14:45:18 +00:00
24c3ad7851 Move private forward grad mode helpers to torch.autograd.forward_ad (#90240)
Motivation
- These were previously defined in functorch. They are not
functorch-specific, so I'm moving them to torch.autograd.forward_ad and
the autograd python bindings.
- I need this to avoid some of my cyclic import problems.

Should these be public APIs? Probably. Though this needs discussion, so
punting it to the future.

Test Plan:
- moved the tests of these from test/functorch/test_eager_transforms.py
to test/test_autograd.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90240
Approved by: https://github.com/soulitzer
2022-12-13 14:14:02 +00:00
3049d99027 autograd.Function supports vmap staticmethod (#90037)
This PR adds a `vmap` staticmethod to autograd.Function and a
corresponding vmap kernel for custom_function_call. These two items mean
that autograd.Function with a vmap staticmethod can be used with vmap.

```py
class NumpyMul(torch.autograd.Function)
    staticmethod
    def forward(x, y):
        return torch.tensor(to_numpy(x) * to_numpy(y), device=x.device)

    staticmethod
    def setup_context(ctx, outputs, x, y):
        ctx.save_for_backward(x, y)

    staticmethod
    def backward(ctx, grad_output):
        x, y = ctx.saved_tensors
        gx = None
        if isinstance(x, torch.Tensor) and x.requires_grad:
            gx = NumpyMul.apply(grad_output, y)
        gy = None
        if isinstance(y, torch.Tensor) and y.requires_grad:
            gy = NumpyMul.apply(grad_output, x)
        return gx, gy

    staticmethod
    def vmap(info, in_dims, x, y):
        x_bdim, y_bdim = in_dims
        x = x.movedim(x_bdim, -1) if x_bdim else x.unsqueeze(-1)
        y = y.movedim(y_bdim, -1) if y_bdim else y.unsqueeze(-1)
        result = NumpyMul.apply(x, y)
        result = result.movedim(-1, 0)
        return result, 0
```

API Spec
- the staticmethod takes two arguments (info, in_dims) as well as the
unexpanded inputs (x, y).
- If we think about it as `vmap(info, in_dims, *args)`, `in_dims` is a
pytree with the same tree structure as args. It has None if the arg is
not being vmapped over and an integer vmapped dimension index if it is.
- `info` is an object with metadata about the vmap. It currently has one
field, `info.batch_size`. In the future we can extend this by adding
things like the randomness information.
- If there is a single vmap going on, (x, y) are NOT BatchedTensors,
they've already been unpacked.
- We expect the user to return a `(outputs, out_dims)` tuple. `out_dims`
must "broadcast" to the same pytree structure as `outputs`.

Semantics
- vmap(NumpyMul.apply)(x) will apply the vmap staticmethod if there is
one and will never actually run NumpyMul.forward.
- In order for the autograd.Function to support nested vmap (e.g.,
`vmap(vmap(NumpyMul.apply))(x)`, then the vmap staticmethod must call
into operations that vmap understands (i.e. PyTorch operators or more
autograd.Function).

At a high level, this PR:
- adds a vmap rule for custom_function_call

Testing
- Added some tests for in_dims and info
- Added vmap staticmethod to most of the autograd.Function in
autograd_function_db and sent them through functorch's vmap-related
OpInfo tests

Future
- Better error messages if the user gets the return contract wrong. I
didn't include them in this PR because it might involve a refactor of
some of the existing code in functorch/_src/vmap.py that will add
~200LOC to the PR, but LMK if you'd prefer it here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90037
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-12-13 14:14:02 +00:00
4dc7d87421 [LTC] Make LazyGraphExecutor::RunPostOrder() virtual (#90680)
Summary:
This patch makes LazyGraphExecutor::RunPostOrder() virtual such that XLA can reuse it.

It's related to https://github.com/pytorch/xla/pull/4315.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90680
Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG
2022-12-13 13:39:23 +00:00
af4735d3ad Revert "Upgrade CI to ROCm5.3 (#88297)"
This reverts commit 181a82ffd26d85bb8dda1b2551dffab2bc04452d.

Reverted https://github.com/pytorch/pytorch/pull/88297 on behalf of https://github.com/IvanYashchuk due to Tests are unnecessarily skipped on all platforms
2022-12-13 12:23:44 +00:00
96a36c9a3b Fix: Apply clang-tidy to c10/core (#90699)
Enables clang-tidy on 'c10/core'. Request by @ezyang to extend coverage of clang-tidy for better performance linting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90699
Approved by: https://github.com/ezyang
2022-12-13 12:07:36 +00:00
ff1bbc2773 Revert "[reland][dynamo] use optimizers correctly in benchmarking (#87492)" (#90746)
This reverts commit d91d7a322172da4d92672301f3cfa3344d544a9e.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90746
Approved by: https://github.com/anijain2305
2022-12-13 11:37:16 +00:00
eae0f3f5e3 Add mkl implementation for exponential on CPU (#69967)
### Description
Add mkl implementation for exponential on CPU to improve the performance of exponential.

### Testing
data type: float32
single socket (28cores):
```
before: torch.Size([10, 128, 10, 124])  0.065 s
        torch.Size([10, 128, 20, 124])  0.130 s

after:  torch.Size([10, 128, 10, 124])  5.9e-05 s
        torch.Size([10, 128, 20, 124])  0.000113 s
```
single core:
```
before: torch.Size([10, 128, 10, 124])  0.065 s
        torch.Size([10, 128, 20, 124])  0.130 s

after:  torch.Size([10, 128, 10, 124])  0.00117 s
        torch.Size([10, 128, 20, 124])  0.002347 s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69967
Approved by: https://github.com/frank-wei, https://github.com/jgong5
2022-12-13 09:51:24 +00:00
a50fe978f8 [LTC] Make even more LazyGraphExecutor interfaces virtual (#90650)
Summary:
This patch makes the following interfaces virtual for XLA to adopt:
1. LazyGraphExecutor::Async.
2. TensorCollectionBarrier
3. SyncLiveTensorsGraph

It's related to https://github.com/pytorch/xla/pull/4314.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90650
Approved by: https://github.com/wconstab
2022-12-13 09:03:28 +00:00
fc429512d5 [FSDP] Clean up FlatParamHandle dtypes, post-backward hook (#90660)
This PR reworks the internal handling of parameter and gradient reduction mixed precision, cleans up the post-backward hook logic, and adds some minor changes to the communication hooks.

**Overview**
This PR addresses everything in https://github.com/pytorch/pytorch/issues/90657 except renaming `keep_low_precision_grads` to `keep_grads_in_reduce_dtype` since that is BC breaking. I recommend reading the issue before preceding.

For `MixedPrecision(param_dtype, reduce_dtype, ...)`, the exact rule for parameter and gradient reduction mixed precision that we are following is:
> If `param_dtype is not None` and `reduce_dtype is None`, then we infer `reduce_dtype = param_dtype`. Otherwise, we take `param_dtype` and `reduce_dtype` as is.

This PR enforces that, at the `FlatParamHandle` level, `handle._config.fwd_bwd_param_dtype` and `handle._config.reduce_dtype` are never `None`. The way to check if mixed precision is enabled is to compare against the original parameter dtype, which is now stored in `handle._orig_param_dtype`. It is no longer to check against `None`.

This avoids ambiguous cases such as when the user passes `MixedPrecision(param_dtype=torch.float32)`. In that case, our existing implementation mistakenly thinks that parameter mixed precision is enabled and either relies on no-ops silently or errors (such as one case reported by MosaicML).

**Additional Details**
- We remove `FullyShardedDataParallel._mixed_precision_enabled_for_params`, `FullyShardedDataParallel._mixed_precision_enabled_for_reduce`, and `FullyShardedDataParallel._mixed_precision_keep_low_precision_grads` since they are not used.
- The unit test `test_meta_device_with_mixed_precision()` exercises a tricky edge case with meta device initialization, `apply()` (calling into `summon_full_params()`), and `param_dtype=torch.float32` for a nested wrapping case, where each nested instance has parameters.
- We include some minor fixes/improvements to the communication hook implementation.

**Follow-Ups**
- We should get rid of `HandleConfig` and store its fields as attributes on `FlatParamHandle` directly.
- Rename `keep_low_precision_grads` to `keep_grads_in_reduce_dtype`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90660
Approved by: https://github.com/zhaojuanmao
2022-12-13 07:34:59 +00:00
ffa89033c5 TorchDynamo: always convert tensor to fake tensor at fake_mode path for ShapeProp (#90685)
This PR will fix https://github.com/pytorch/torchdynamo/issues/1978, for HF models, there is always report a ShapeProp error, the root cause is that we use fake tensor mode to do the ShapeProp, but for **torch.ones**, it always gets a none fake tensor and introduces an operation with non-fake tensors with fake tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90685
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-12-13 06:59:43 +00:00
7a7f29704f Remove hard numpy dep introduced by _inductor/utils.py (#90716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90716
Approved by: https://github.com/cpuhrsch
2022-12-13 04:58:26 +00:00
181a82ffd2 Upgrade CI to ROCm5.3 (#88297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88297
Approved by: https://github.com/malfet
2022-12-13 04:50:06 +00:00
7498e23bd5 Re-enabled 2 Metaprogramming tests on Windows (#87284)
With C++17 these tests are not failing

Fixes #25161

Depends on https://github.com/pytorch/pytorch/pull/85969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87284
Approved by: https://github.com/soulitzer
2022-12-13 04:34:26 +00:00
dc4d18d47d Remove hack to hard code test times (#90720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90720
Approved by: https://github.com/janeyx99
2022-12-13 04:28:01 +00:00
1f86a1447b [c10d] remove some outdated bc checks for c10d op (#90681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90681
Approved by: https://github.com/H-Huang
2022-12-13 04:21:45 +00:00
7da504508d [c10d] update alltoall signature to be more consistent (#90569)
alltoall signature should be more consistent with its argument
updated and this should be a BC breaking change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90569
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
f30694c700 Add allgather_into_tensor to CommTensor (#90565)
This PR adds _all_gather_base_ to CommTensor to support allgather_base
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90565
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
b782927ed4 Add reduce_scatter_tensor to CommTensor (#90564)
This PR adds reduce_scatter_base to the CommTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90564
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
3ba9e4cd55 Add alltoall_ to CommTensor (#90512)
This PR adds alltoall_ to the CommTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90512
Approved by: https://github.com/mrshenli
2022-12-13 04:18:02 +00:00
6165a1807d [LTC] Make DeviceContextArena protected (#90531)
Summary:
This patch makes DeviceContextArena protected such that XLAGraphExecutor can reuse it. In addition, it makes all methods that utilize DeviceContextArena virtual such that XLAGraphExecutor can override them to provide its own DeviceContextArena.

P.S. This patch depends on pytorch/xla#4307 too.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90531
Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG
2022-12-13 04:17:41 +00:00
b8f35ec6a5 Guard Symbol and ShapeGuardPrinter behind HAS_SYMPY (#90704)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Follow up to https://github.com/pytorch/pytorch/pull/90528

Fixes https://github.com/pytorch/pytorch/issues/90696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90704
Approved by: https://github.com/weiwangmeta, https://github.com/atalman, https://github.com/malfet
2022-12-13 03:56:56 +00:00
ea64c8c6ad Revert "[torchgen] Let native function declaration generation logic take a callable (#90590)"
This reverts commit de6beca838a4ff8f08ec2f51934f8c35cf5260ce.

Reverted https://github.com/pytorch/pytorch/pull/90590 on behalf of https://github.com/seemethere due to Causes internal failures, see https://www.internalfb.com/intern/sandcastle/job/4503600464398605/insights
2022-12-13 03:41:04 +00:00
b3e6a6dc0b Revert "[torchgen] Introduce Executorch types and signatures (#90591)"
This reverts commit ddf00c803b2a99f4eec8a040b53ee18f62800fdd.

Reverted https://github.com/pytorch/pytorch/pull/90591 on behalf of https://github.com/seemethere due to Part of a stack that causes internal failures, see https://www.internalfb.com/intern/sandcastle/job/4503600464398605/insights
2022-12-13 03:36:31 +00:00
42a5f6ee5d Create stub function for doing SDPA cpp and cuda dispatch (#90576)
## Summary
Torch.compile was previously not working for transformerencoder because torch.SDPA calls a native function on tensors that returns an int. This PR instead creates a dispatch stub for the function called in order to not create a separate fx node for this native function.
As well this pr adds meta functions for the fused kerenels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90576
Approved by: https://github.com/cpuhrsch
2022-12-13 03:19:40 +00:00
df569367ef Fix non-existing parameters in docstrings in torch/fx (#90594)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90594
Approved by: https://github.com/clee2000
2022-12-13 01:19:28 +00:00
94b9bb324f [quant] Add example for lowering quantized dynamic linear pattern through delegation (#90640)
Summary: Only the pattern part, will leave the delegation example to Chen

Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic"

Reviewed By: cccclai

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90640
Approved by: https://github.com/cccclai
2022-12-13 00:57:33 +00:00
b6f114c208 Fix a minor typo in documentation (#90667)
This change fixes a typo in function's documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90667
Approved by: https://github.com/kit1980
2022-12-13 00:41:25 +00:00
98a9235dce Fix prelu ref when a.ndim < 2 (#89809)
Fixes https://github.com/pytorch/pytorch/issues/89560

Previously the test case for "input is 1-D or scalar + weight is not scalar" did not exist; adding it introduced some failures:
- forward AD (fixed in this PR)
- vmap (filed https://github.com/pytorch/pytorch/issues/89895)
- ref/meta (fixed this PR, though this also regresses nvFuser support)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89809
Approved by: https://github.com/ngimel
2022-12-12 23:55:31 +00:00
34dc34e8a0 Add comment to output_code in dynamo config (#90333)
Title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90333
Approved by: https://github.com/mlazos
2022-12-12 23:36:01 +00:00
7bb97c4ca4 move TypedStorage handling to assertEqual (#89557)
#85303 added a patch to `torch.testing.assert_close` to handle `torch.storage.TypedStorage`'s. This change is not reflected in the docs and is not intended for the public API. This PR removes the patch ones again and moves the behavior to `TestCase.assertEqual` instead. Meaning, `TypedStorage`'s are again not supported by the public API, but the behavior is the same for all internal use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89557
Approved by: https://github.com/kurtamohler, https://github.com/mruberry
2022-12-12 23:26:00 +00:00
17941b12e0 Fix a typo in some torch.load error message. (#90662)
Very cosmetic change: only fixes a small typo in an error message that torch.load could raise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90662
Approved by: https://github.com/kit1980
2022-12-12 22:34:57 +00:00
e2674aafed [Dynamo] Supports calling parent class‘s non classmethod from child class (#90682)
Fixes https://github.com/pytorch/pytorch/issues/90558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90682
Approved by: https://github.com/jansel
2022-12-12 22:33:46 +00:00
e11650887e [ao] fix incorrect integer cast on histogram observer bounds (#90355)
Summary: A cast to int was added in
https://github.com/pytorch/pytorch/pull/45630 to make mypy not complain.
However this leads to unexpected behavior where the histogram doesn't
actually capture the full range of activation values.

note1: the test_histogram_observer_against_reference test was secretly
broken, on master. The random parameters that normally get run apparently don't cause a test failure but if you make a loop repeatedly run the test, it would
eventually fail. This was due to in some cases
sum(<tensor>)!=torch.sum(<tensor>).item(). I was not able to reproduce
this with a toy example but running this test in a loop and editing
either observer to print the calculation for 'total' would break the
test and show different behaviors. Fixing this test was necessary to
land this PR since the changing histogram bounds changed things enough
that this test would error.

note2: updating histogram observer breaks some BC tests unless I regenerate the
model using the HistogramObserver from this PR

Test Plan: python test/test_quantization.py TestHistogramObserver.test_histogram_observer_correct_numel

python test/test_quantization -k histogram

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90355
Approved by: https://github.com/vkuzo
2022-12-12 20:30:44 +00:00
60e196c241 Better url in trymerge (#90583)
example:
old: 453b510b2d/checks
new: https://github.com/pytorch/pytorch/actions/runs/3644518486
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90583
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2022-12-12 19:19:56 +00:00
f258753799 [ONNX] Add repro export from GraphInfo (#89947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89947
Approved by: https://github.com/justinchuby
2022-12-12 19:13:39 +00:00
525c33c09f [ONNX] Verification tool to find mismatch in model export (#89946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89946
Approved by: https://github.com/justinchuby
2022-12-12 17:56:48 +00:00
4ed175bfb7 fix with statement in test_fsdp_hybrid_shard.py (#90580)
Fixes PR #89915.  The following syntax was not permitted until 3.10:

```
with (
    patch_allreduce(patched_allreduce),
    patch_reduce_scatter(patched_reduce_scatter),
):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90580
Approved by: https://github.com/awgu
2022-12-12 17:43:30 +00:00
06326a7721 [optim] skip .item calls in all optimizers when compiling with dynamo (#88173)
@mlazos: skips `item()` calls if compiling with dynamo, by defining a helper function `_get_value` which either returns the result of `.item()` or the scalar cpu tensor if compiling with dynamo. This was done because removing `item()` calls significantly regresses eager perf. Additionally, `_dispatch_sqrt` calls the appropriate sqrt function (math.sqrt, or torch.sqrt).

Fixes https://github.com/pytorch/torchdynamo/issues/1083

This PR will no longer be needed once symint support is default.

This PR closes all remaining graph breaks in the optimizers (!!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88173
Approved by: https://github.com/albanD
2022-12-12 17:32:35 +00:00
7541c9f8be [Fix]: remove unnecessary copies in aten, c10, and torch bindings (#90629)
Applies various automated fixes that reduces the number of spurious copies in torch, aten, and c10. I also inlined any default dtors that would have made the type trivially destructible.

Follow up to #89000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90629
Approved by: https://github.com/ezyang
2022-12-12 17:05:52 +00:00
27932ff8c9 [Inductor] Add note that stride_vars result may be inaccurate (#90184)
Strides are are determined by substituting 1 and 0 for different indices, which
will fail for any expression that doesn't match the expected stride calculation.
So, lets add a note to make this clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90184
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-12-12 16:49:12 +00:00
dcce5677fd Adding test when registering a batching rule for a CompositeImplicitAutograd operation (#89465)
This is a Follow on from https://github.com/pytorch/pytorch/pull/88771 which should close out https://github.com/pytorch/functorch/issues/1009 I've got another PR where I'm moving some operators over https://github.com/pytorch/pytorch/pull/89762

you can see that the new test file is being picked [run here](https://github.com/pytorch/pytorch/actions/runs/3617298059/jobs/6096218583#step:10:472)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89465
Approved by: https://github.com/zou3519
2022-12-12 16:21:07 +00:00
e37c8c8436 Revert "[inductor] Update TIMM skip list (#90188)"
This reverts commit fd3f5d7bf7247be662fcb47156bcbe4c6fa04903.

Reverted https://github.com/pytorch/pytorch/pull/90188 on behalf of https://github.com/desertfire due to flaky accuracy failure
2022-12-12 15:31:50 +00:00
0b3316ad2c Don't enable debug_fake_crossref for TORCH_COMPILE_DEBUG (#90666)
It is kind of flaky, it doesn't work with dynamic shapes, and I think the debug interpreter is a better way to detect if you've had a size/stride propagation accident.

Fixes https://github.com/pytorch/pytorch/issues/90652

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90666
Approved by: https://github.com/voznesenskym
2022-12-12 14:20:40 +00:00
f7365eca90 Add unbacked symints support; item works now (#90624)
The big idea is to add `create_unbacked_symfloat` and `create_unbacked_symint` to ShapeEnv, allowing you to allocate symbolic floats/ints corresponding to data you don't know about at compile time. Then, instead of immediately erroring out when you try to call local_scalar_dense on a FakeTensor, we instead create a fresh symint/symfloat and return that.

There a bunch of odds and ends that need to be handled:

* A number of `numel` calls converted to `sym_numel`
* When we finally return from item(), we need to ensure we actually produce a SymInt/SymFloat when appropriate. The previous binding code assumed that you would have to get a normal Python item. I add a pybind11 binding for Scalar (to PyObject only) and refactor the code to use that. There is some trickiness where you are NOT allowed to go through c10::SymInt if there isn't actually any SymInt involved. See comment.
* One of our unit tests tripped an implicit data dependent access which occurs when you pass a Tensor as an argument to a sizes parameter. This is also converted to support symbolic shapes
* We now support tracking bare SymInt/SymFloat returns in proxy tensor mode (this was already in symbolic-shapes branch)
* Whenever we allocate an unbacked symint, we record the stack trace it was allocated at. These get printed when you attempt data dependent access on the symint (e.g., you try to guard on it)
* Subtlety: unbacked symints are not necessarily > 1. I added a test for this.

These unbacked symints are not very useful right now as you will almost always immediately raise an error later when you try to guard on them. The next logical step is adding an assertion refinement system that lets ShapeEnv learn facts about unbacked symints so it can do a better job eliding guards that are unnecessary.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90624
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2022-12-12 13:33:07 +00:00
6702345416 [xla hash update] update the pinned xla hash (#90161)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90161
Approved by: https://github.com/pytorchbot
2022-12-12 10:27:03 +00:00
5adc18dcbc Shape guard structure (#90679)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90679
Approved by: https://github.com/ezyang
2022-12-12 09:50:00 +00:00
2e0ce24890 [Dynamo] Support access nn.Module keys (#90502)
Fixes https://github.com/pytorch/torchdynamo/issues/1973

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90502
Approved by: https://github.com/jansel
2022-12-12 09:15:42 +00:00
c8ed84ad06 Fix a static initialization order fiasco in c10d (#90149)
The `TORCH_LIBRARY_IMPL` registrations in `OpsImpl.cpp` needs to happen after `ProcessGroup` is registered as a torch class -- which happens in `Ops.cpp`. However, the order of the registrations is undefined between the two files.

If the registration in `OpsImpl.cpp` runs before `Ops.cpp`, we get a crash at program launch similar to #83255 . This happens in our internal build.

This PR moves `OpsImpl.cpp` to the end of `Oops.cpp`. Because according to the omniscient lord of chatGPT:
<img width="600" alt="2022-12-04_19-25" src="https://user-images.githubusercontent.com/1381301/205542847-3535b319-3c2a-4e8e-bc11-27913f6afb39.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90149
Approved by: https://github.com/kwen2501, https://github.com/H-Huang, https://github.com/soumith
2022-12-12 08:21:54 +00:00
4ca2fc485c inductor(CPU): add Conv+binary+unary fusion filter (#90259)
For Conv+binary+unary fusion, we only support conv+add+relu, this PR adds a such check to fix TIMM failed models.
TODO: enable more Conv+binary+unary fusion to improve TIMM models' performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90259
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2022-12-12 06:04:55 +00:00
c318de4274 [dynamo] Get GPU names without calling nvidia-smi (#90474)
Believe it or not, inductor can sometimes be used on machines that
have CUDA GPUs but no nvidia-smi.  Let's use torch APIs instead of subprocess.

Differential Revision: [D41841930](https://our.internmc.facebook.com/intern/diff/D41841930/)

Differential Revision: [D41841930](https://our.internmc.facebook.com/intern/diff/D41841930)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90474
Approved by: https://github.com/voznesenskym, https://github.com/anijain2305
2022-12-12 05:31:50 +00:00
b95ea4f149 [pt2] Reset dynamo log level when exiting inductor debug context (#90473)
When entering an inductor debug context we increase the log level of
dynamo; I guess this makes sense, since if we're debugging inductor, and
inductor calls into dynamo, we probably want visibility into what dynamo is
doing.

But when we exit that context, we probably want to go back to whatever level of
dynamo-specific logging was in place before.  Dynamo generates lots of debug
info (guards, bytecode), and it's a lot to sift through if you're not
specifically interested in it.

Differential Revision: [D41841879](https://our.internmc.facebook.com/intern/diff/D41841879/)

Differential Revision: [D41841879](https://our.internmc.facebook.com/intern/diff/D41841879)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90473
Approved by: https://github.com/mlazos, https://github.com/jansel
2022-12-12 04:39:37 +00:00
d3d85e1c3b Emit torch.cuda.synchronize() after every kernel call in inductor (#90472)
Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1
and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace.
doesn't necessarily guarantee that you'll get a stack trace pointing to the
right kernel.  This diff adds a config option to force a CUDA synchronize after
every kernel call in inductor, for debugging those tricky cases.

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/)

Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472
Approved by: https://github.com/jansel
2022-12-12 04:35:10 +00:00
8fd31ac4da Preserve original GraphArgs for shape guard codegen (#90665)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90665
Approved by: https://github.com/voznesenskym
2022-12-12 02:35:23 +00:00
9447005ae3 Improve dynamo debug logging (#90664)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90664
Approved by: https://github.com/voznesenskym
2022-12-12 02:35:23 +00:00
450bd282e0 Slightly improve error messages on sympy failure (#90655)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90655
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2022-12-12 01:58:34 +00:00
8127724c3b Skip some unittests (#90609)
* Skip a unittest that needs FFT if not built with FFT
* Mark a test with "slow": `python test/test_ops.py -k TestCompositeComplianceCUDA.test_forward_ad_svd_lowrank_cuda_float32` took >5min on my machine.
* Skip a flaky test that's marked "expectedFailure", similar to #90233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90609
Approved by: https://github.com/soumith
2022-12-11 23:53:05 +00:00
11442accc6 Make torch._guards, shuffle structures around for migration (#90636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90636
Approved by: https://github.com/ezyang
2022-12-11 23:16:07 +00:00
e1ed5ad5a5 Add a timeout to benchmark script (#90634)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90634
Approved by: https://github.com/voznesenskym
2022-12-11 23:12:29 +00:00
5d8618dfbd Some memory saving in large unittests (#90148)
Two tests test_large_cumsum, test_large_cumprod use a lot of memory. This PR:
* Reduces their memory usage by: avoid `self.assertEqual` and avoid a temporary python variable
* Mark their memory requirement by decorator.

related to https://github.com/pytorch/pytorch/issues/84944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90148
Approved by: https://github.com/soumith
2022-12-11 21:04:38 +00:00
995d39c221 [Fix]: Add some missing moves in 90442 (#90661)
@ezyang Noticed a couple of missing std::move for all the symints from #90442. Also I noticed a couple of helper functions didn't seem like they needed to take ownership.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90661
Approved by: https://github.com/ezyang
2022-12-11 20:23:40 +00:00
e33f1eeeb7 SymIntify resize_ and deduplicate memory format logic (#90442)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90442
Approved by: https://github.com/bdhirsh
2022-12-11 14:38:38 +00:00
181d37475d Simple fix: add missing positional arg in init_optimizer() call (#90641)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90641
Approved by: https://github.com/kit1980
2022-12-11 13:18:05 +00:00
15a4c60383 Revert "Make torch._guards, shuffle structures around for migration (#90636)"
This reverts commit 933b6c4eed675d33274d0bc1dfcb9d8446f412d8.

Reverted https://github.com/pytorch/pytorch/pull/90636 on behalf of https://github.com/huydhn due to Breaking lint on master. Please rebase and run lintrunner -a before re-merging the PR
2022-12-11 10:15:47 +00:00
7ec1cb8553 [FSDP] Fix _pre_forward type annotation (#90621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90621
Approved by: https://github.com/awgu, https://github.com/Skylion007
2022-12-11 06:39:38 +00:00
80542add73 [FSDP] Allow MixedPrecision to skip inputs (#90620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90620
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-12-11 06:39:38 +00:00
31351c61dd [FSDP] Tighten post-bwd cast to reduce_dtype (#90615)
This lowers the `reduce_dtype` retrieval to the `handle` instead of the `state` in preparation for `fully_shard`, and this adds a guard to avoid a no-op `to()` call.

Note that this change pretty much gets overridden in following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90615
Approved by: https://github.com/rohan-varma
2022-12-11 06:39:34 +00:00
933b6c4eed Make torch._guards, shuffle structures around for migration (#90636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90636
Approved by: https://github.com/ezyang
2022-12-11 06:04:17 +00:00
c7d2fb7f86 Adopt state_dict_pre_hook in FSDP (#90436)
Use register_state_dict_pre_hook in FSDP to simplify state_dict implementations & remove hacks. This removes `def state_dict` entirely and paves the path for composable API as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90436
Approved by: https://github.com/fegin
2022-12-11 03:54:26 +00:00
746c773d7c [FSDP][Easy] Move to _storage() in test file (#90622)
This is to silence some deprecation warnings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90622
Approved by: https://github.com/rohan-varma
2022-12-11 03:50:30 +00:00
6845598617 [FSDP] Uncomment test for use_orig_params=True (#90610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90610
Approved by: https://github.com/rohan-varma
2022-12-11 03:50:23 +00:00
e7efeb5282 [FSDP] Save _stream_to_name for debugging (#90611)
This saves a data structure `_stream_to_name: Dict[torch.cuda.Stream, str]` that maps each FSDP stream to its name. This can help in debugging by checking `_stream_to_name[torch.cuda.current_stream()]` to see if it is `"default"` or `"unshard"` in the post-backward hook for example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90611
Approved by: https://github.com/rohan-varma
2022-12-11 03:46:18 +00:00
184f6b5787 Fix perf bug in #90528 (#90630)
Fixes a minor I noticed in #90528 also a follow up to #89000. @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90630
Approved by: https://github.com/ezyang
2022-12-11 01:00:05 +00:00
9eccfedca2 [Reland][FSDP] Another fix for DTensor, use_orig_params=True (#90562)
This is a reland of https://github.com/pytorch/pytorch/pull/89845 with nothing changed. This should avoid the internal breakage now that `DTensor` does not import `torchgen` (https://github.com/pytorch/pytorch/pull/90106).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90562
Approved by: https://github.com/fduwjj
2022-12-10 22:50:30 +00:00
a69cdd9cf8 Add global registry to composable API contract (#90579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90579
Approved by: https://github.com/awgu, https://github.com/yhcharles
2022-12-10 22:41:10 +00:00
12671fe620 Reserve space for std::vector output in extract_tensors for nccl python bindings (#88203)
Optimizes the nccl python bindings to reserve space when converting PythonObject* into Tensors. This should reduce the number of unnecessary allocations in the nccl bindings as the std::vector grows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88203
Approved by: https://github.com/ezyang
2022-12-10 20:28:19 +00:00
583d216c1a Fix: [ATen] add more missing moves - part 2 (#89000)
Applies some more missing std::move found by static analysis. This should improve performance and reduce unnecessary copies. This PR only targets ATen for now.

And before you ask about the edits, std::move is optimal in a ternary operator as copy ellision cannot happen one. The best thing is probably rewriting it as an if else, but ultimately this should be performant enough.
Followup to #88512 and #88514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89000
Approved by: https://github.com/ezyang
2022-12-10 20:13:45 +00:00
9ef1d55e6b Fix non-existing parameters in docstrings in torch/nn (#90596)
This is a continuation of https://github.com/pytorch/pytorch/pull/90505

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90596
Approved by: https://github.com/lezcano
2022-12-10 14:37:31 +00:00
45109ec30a Completely redo how ShapeEnv guards are generated (#90528)
Instead of inferring shape mappings from a bunch of data structures that were plumbed in InstructionTranslator, we instead work out mappings by just iterating over the GraphArgs and mapping symbols to arguments as they show up. If multiple argument sizes/strides/offset map to the same symbol, this means they are duck sized, so we also generate extra equality tests that they must be equal. Finally, we generate 0/1 specialization guards. The resulting code is much shorter, and I think also easier to understand.

TODO: Delete all the tensor ref tracking code, it's unnecessary

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90528
Approved by: https://github.com/voznesenskym
2022-12-10 13:35:04 +00:00
49c674e155 Revert guaranteed symint allocation (#90381)
So, uh, I have a new strategy for generating dupe guards, one where I don't actually need to allocate symints for every tensor that is fakeified. So I'm reverting the changes I made from earlier PRs in this one.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90381
Approved by: https://github.com/voznesenskym
2022-12-10 13:17:34 +00:00
b68dead20c Keep track of source name on all allocated SymInts (#90295)
Wow, I had to sweat so much to get this PR out lol.

This PR enforces the invariant that whenever we allocate SymInts as part of fakeification, the SymInt is associated with a Source, and in fact we store the string source name on SymbolWithSourceName. We use 'sname' as the shorthand for source name, as 'name' is already used by sympy to name symbols.

In order to store source names, we have to plumb source names from Dynamo to PyTorch. This made doing this PR a bit bone crushing, because there are many points in the Dynamo codebase where we are improperly converting intermediate tensors into fake tensors, where there is no source (and there cannot be, because it's a frickin' intermediate tensor). I've fixed all of the really awful cases in earlier PRs in the stack. This PR is just plumbing in source names from places where we do have it.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90295
Approved by: https://github.com/voznesenskym
2022-12-10 13:17:34 +00:00
f9aa099074 [Inductor] fix issue: redeclaration of float g_tmp_buffer_xxx (#90270)
This pr is to fix the issue: redeclaration of 'float g_tmp_buffer_in_ptr1[16] = {0};'
If a bool or uint8 tensor is used by multiple op, this tensor will be loaded multiple times. On load, it writes the declaration of this variable, i.e., `self.loads.writeline(f"float {g_tmp_buf}[{nelements}] = {{0}};")`, which will introduce redeclaration error.

![image](https://user-images.githubusercontent.com/69951214/205869956-5c325761-dc09-4aa8-a9ed-fad7f4c85917.png)
![image](https://user-images.githubusercontent.com/69951214/205870695-ee252f17-8f54-484f-9b0a-3a424c479327.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90270
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2022-12-10 12:59:30 +00:00
5a665a39d1 [LTC] Make some LazyGraphExecutor private data structures protected (#90598)
Summary:
This pull request makes some LazyGraphExecutor private data structures protected such that XLAGraphExecutor can reuse them.

Here is the list:
1. DeviceLocker.
2. DeviceLockerArena.
3. DataCacheArena. In addition, it also introduces LazyGraphExecutor::ResetTrimCounter() such that XLAGraphExecutor can reuse the trim counter.

Test Plan:
CI.

P.S. This is to re-land #90457.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90598
Approved by: https://github.com/JackCaoG
2022-12-10 08:19:12 +00:00
ddf00c803b [torchgen] Introduce Executorch types and signatures (#90591)
Retry of #89595. Accidentally closed.

## Forked `BaseCppType`

Created a module for Executorch: `torchgen.executorch`.

In `torchgen.executorch.api.types.types`:
* Define `BaseCppType` with `torch::executor` namespace.

In `torchgen.executorch.api.et_cpp`:
* Help generate `NamedCType` for `ExecutorchCppSignature` arguments.

In `torchgen.executorch.api.types.signatures`:
* Define the signature using these types. (`ExecutorchCppSignature`)

In `torchgen.executorch.api.types.__init__`:
* Suppress flake8 error for `import *`.

Differential Revision: [D41501836](https://our.internmc.facebook.com/intern/diff/D41501836/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90591
Approved by: https://github.com/iseeyuan
2022-12-10 04:34:02 +00:00
de6beca838 [torchgen] Let native function declaration generation logic take a callable (#90590)
Retry of #89594. Accidentally closed.

This PR allows `get_native_function_declarations` API to take a function as argument. This function should take `NativeFunction` as input and emit code for native function declaration. By default it is `dest.compute_native_function_declaration`.

Differential Revision: [D41501838](https://our.internmc.facebook.com/intern/diff/D41501838/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90590
Approved by: https://github.com/iseeyuan
2022-12-10 04:34:02 +00:00
453ff96029 [torchgen] Refactor types (#90589)
A retry of #89487. Accidentally closed.

## Split `torchgen.api.types` into `types_base`, `types` and `signatures`.

In `types_base`:
* Created base class `CType`. `BaseCType` and `ConstRefCType` etc are inheriting `CType`.
* Only keep abstract type model definitions, such as `BaseCppType`.

In `types`:
* Define `BaseCppType` with `at` and `c10` namespaces.
* All the signatures using these types.

In `signatures`:
* Define all the signatures.

In `__init__`:
* `from ... import *`, suppress flake8 error.

Differential Revision: [D41455634](https://our.internmc.facebook.com/intern/diff/D41455634/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41455634/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90589
Approved by: https://github.com/iseeyuan
2022-12-10 04:34:00 +00:00
0457020d2c [dims] Fix large array inputs (#88596)
Variable length arguments can overflow the arena being used to keep overhead
low for torch dims. If we hit this case, we know the amount of work being done
is already relatively big, so we just fallback to standard memory allocation.

Fixes #88586
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88596
Approved by: https://github.com/ezyang
2022-12-10 03:49:16 +00:00
bb9fc32fe0 [vision hash update] update the pinned vision hash (#90586)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90586
Approved by: https://github.com/pytorchbot
2022-12-10 03:22:35 +00:00
d3a3604581 [pthreadpool] Don't recreate threadpool if the counts are same (#90478)
Summary: Don't do anything if the incoming count and current threadpool size are same

Test Plan: CI

Reviewed By: salilsdesai

Differential Revision: D41628132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90478
Approved by: https://github.com/salilsdesai
2022-12-10 03:17:08 +00:00
3b3ed25109 Add a way to visualize memory snapshot traces (#90348)
This adds a d3-based interactive visualization for exploring the memory
allocation traces that the caching allocator can capture. This visualization
code can also be attached to kineto trace information in the future to also
provide visualization for the memory events captured there, which come with
addition information about the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90348
Approved by: https://github.com/robieta
2022-12-10 02:45:11 +00:00
2bac4d1fae [reland] add save and load stats in memory_tracker (#90510)
reland https://github.com/pytorch/pytorch/pull/90144, this PR removed temporary path "memory.trace" in the unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90510
Approved by: https://github.com/rohan-varma
2022-12-10 01:39:22 +00:00
1b2c59ad24 [ONNX] Introduce ONNX reference evaluator for verification (#89808)
Reference evaluator requires ONNX >= 1.13. Running in CI is blocked by unable to bump onnx submodule version, like in #83201. Local tests pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89808
Approved by: https://github.com/justinchuby
2022-12-10 01:29:12 +00:00
7afba50508 [dtensor] delete unused torch_function (#90449)
torch_function is not actually getting used yet today, deleting
it first and we can revisit once we really need it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90449
Approved by: https://github.com/fduwjj
2022-12-10 01:29:02 +00:00
45b64e8c61 Populate Canonical Aten Ops (Batch 2) (#90456)
acos
argmax
argmin
acosh
asinh
atanh
asin
atan
logical_not
logical_and
logical_or
cos
cosh
empty_strided
full
isnan
sin
sinh
scatter_reduce.two
bitwise_xor.Tensor
sign
fmod.Tensor
remainder.Tensor
pow.Tensor_Tensor
is_inf
ne.Scalar
ne.Tensor
eq.Tensor
ge.Tensor
le.Tensor
gt.Tensor
lt.Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90456
Approved by: https://github.com/ezyang
2022-12-10 00:27:37 +00:00
79f9672249 [ONNX] Use VerificationOptions to wrap option arguments (#89807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89807
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2022-12-09 23:49:51 +00:00
6de216a2e8 [fx] Have replace_pattern return replaced nodes (#90244)
Summary: Modified replace_pattern in the subgraph rewriter to return a list of pairs of matches along with their corresponding replacement nodes in the modified graph (`List[Tuple[Match, List[Node]]]`). This allows us to easily modify the replaced nodes, including setting the metadata.

Test Plan: CI

Differential Revision: D41737056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90244
Approved by: https://github.com/SherlockNoMad
2022-12-09 23:43:16 +00:00
4a1633ca69 [Inductor] GEMM Shape Padding Optimization (#90425)
Summary:
Optimize the shape padding in the following perspectives:
- Add BFloat16 support for AMP training and Float16 support for inference
- Optimize microbenchmark to avoid peak memory issue, and include profiling memory ops to make more accurate decision
- Add a flag to turn off/on padding dims N and M in `torch.bmm` due to expensive memory copy of `.contiguous` to avoid peak memory issues in internal models

Test Plan: CI

Differential Revision: D41724868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90425
Approved by: https://github.com/jianyuh
2022-12-09 22:48:02 +00:00
b7dfbf876f Revert "[LTC] Make some LazyGraphExecutor private data structures protected (#90457)"
This reverts commit 93aa6e3e36c022a01076d84047acd58b59244348.

Reverted https://github.com/pytorch/pytorch/pull/90457 on behalf of https://github.com/clee2000 due to broke xla somehow 93aa6e3e36 https://github.com/pytorch/pytorch/actions/runs/3659842773/jobs/6186552659
2022-12-09 22:28:24 +00:00
02eb0bdbc1 [fx] Added better tests to pass infra (#90432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90432
Approved by: https://github.com/SherlockNoMad
2022-12-09 21:43:18 +00:00
f51f6aa387 Fix non-existing parameters in docstrings (#90505)
Continuation after https://github.com/pytorch/pytorch/pull/90163.

Here is a script I used to find all the non-existing arguments in the docstrings (the script can give false positives in presence of *args/**kwargs or decorators):

_Edit:_
I've realized that the indentation is wrong for the last `break` in the script, so the script only gives output for a function if the first docstring argument is wrong. I'll create a separate PR if I find more issues with corrected script.

``` python
import ast
import os
import docstring_parser

for root, dirs, files in os.walk('.'):
    for name in files:
        if root.startswith("./.git/") or root.startswith("./third_party/"):
            continue
        if name.endswith(".py"):
            full_name = os.path.join(root, name)
            with open(full_name, "r") as source:
                tree = ast.parse(source.read())
                for node in ast.walk(tree):
                    if isinstance(node, ast.FunctionDef):
                        all_node_args = node.args.args
                        if node.args.vararg is not None:
                            all_node_args.append(node.args.vararg)
                        if node.args.kwarg is not None:
                            all_node_args.append(node.args.kwarg)
                        if node.args.posonlyargs is not None:
                            all_node_args.extend(node.args.posonlyargs)
                        if node.args.kwonlyargs is not None:
                            all_node_args.extend(node.args.kwonlyargs)
                        args = [a.arg for a in all_node_args]
                        docstring = docstring_parser.parse(ast.get_docstring(node))
                        doc_args = [a.arg_name for a in docstring.params]
                        clean_doc_args = []
                        for a in doc_args:
                            clean_a = ""
                            for c in a.split()[0]:
                                if c.isalnum() or c == '_':
                                    clean_a += c
                            if clean_a:
                                clean_doc_args.append(clean_a)
                        doc_args = clean_doc_args
                        for a in doc_args:
                            if a not in args:
                                print(full_name, node.lineno, args, doc_args)
                            break

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90505
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2022-12-09 21:43:09 +00:00
fd3f5d7bf7 [inductor] Update TIMM skip list (#90188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90188
Approved by: https://github.com/anijain2305
2022-12-09 21:30:23 +00:00
1a735a8094 [FSDP] Subtest CPUOffload for test_fsdp_grad_acc.py (#90545)
In preparation for the next PR, I wanted to reduce the time to run these gradient accumulation tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90545
Approved by: https://github.com/mrshenli
2022-12-09 21:28:27 +00:00
912748e3b7 [SDP] Fix alignment check for efficient_attention (#90413)
Fixes a bug found using head_dim_size==100 on an a100 gpu. This PR contains stricter guards on the input shape. These constraints are taken from xformers: https://github.com/facebookresearch/xformers/blob/gh/danthe3rd/60/orig/xformers/ops/fmha/cutlass.py#L23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90413
Approved by: https://github.com/mikekgfb
2022-12-09 21:09:25 +00:00
669f7461ac Use some if constexpr in the code (#90483)
As PyTorch is C++17 project now. Replace `c10::guts::if_constexpr` with `if constexpr`

Deliberately delaying changes in headers until at least one nightly
cycle is complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90483
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2022-12-09 20:41:50 +00:00
d91d7a3221 [reland][dynamo] use optimizers correctly in benchmarking (#87492)
Reland https://github.com/pytorch/pytorch/pull/87311

mlazos: updated to use SGD to not add a bunch of additional memory allocations (like Adam)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87492
Approved by: https://github.com/desertfire
2022-12-09 20:32:53 +00:00
9c4189f82d [dynamo] Add is_compiling for dynamo (#90329)
`is_tracing` returns True during dynamo tracing and False when run in Eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90329
Approved by: https://github.com/jansel
2022-12-09 20:19:41 +00:00
082450609c [FSDP] Allow nested FSDP wrapper to use different mixed precision (#90523)
The main change is to move `args` and `kwargs` dtype convertion
from `_root_pre_forward` to `_pre_forward`, so that every
FSDP has a chance to apply its own precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90523
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2022-12-09 20:06:05 +00:00
eedf7a4989 Log1p complex for CUDA (#90422)
Another pull request in the direction of solving #89205: log1p for complex numbers in CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90422
Approved by: https://github.com/lezcano
2022-12-09 19:53:22 +00:00
b2795d3c4e Revert "[inductor] New approach for computing triton load/store masks (#89566)"
This reverts commit c6c2de586d7f6ecd6a3eb5139870824f33a1f916.

Reverted https://github.com/pytorch/pytorch/pull/89566 on behalf of https://github.com/clee2000 due to broke test_invalid_operand_issue1_cuda in inductor/test_torchinductor on https://github.com/pytorch/pytorch/actions/runs/3657444733/jobs/6181700572
2022-12-09 19:36:25 +00:00
4e1881b8b7 use proper temp directories in test_tensorboard.py (#89826)
The old `temp_dir` is created under `PWD`. But `PWD` may not be writable and in general is not a good place to create temporary directories. Use the standard `tempfile` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89826
Approved by: https://github.com/soumith
2022-12-09 19:33:03 +00:00
09ccda0d94 Fix: Make __len__ of datapipes dynamic (#88302)
Fixes #88074

Several datapipes have their lengths cached on being executed for the first time. However, source datapipes might change in length (most prominently, whenever `apply_sharding` is called). The behaviour is counter-intuitive because we do not expect `__len__` to have side-effects.

This PR makes `__len__` dynamically computed.

Changes:
- Add note to the `datapipes` README that `__len__` should be dynamic and why.
- Remove caching of length computations in `ConcaterIterDataPipe`, `MultiplexerIterDataPipe`, `ZipperIterDataPipe`, `BatcherIterDataPipe`, `ConcaterMapDataPipe`, and `BatcherMapDataPipe`.
- This required removal of the `length` attribute in setstate/getstate of `MultiplexerIterDataPipe`. I am unsure whether to remove this completely and risk breaking saved checkpoints (as I did) or whether to just ignore the `length` of the loaded `state`.
- This also means the classes above no longer have a `length` attribute. I have found no uses of this, though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88302
Approved by: https://github.com/NivekT
2022-12-09 19:15:53 +00:00
93aa6e3e36 [LTC] Make some LazyGraphExecutor private data structures protected (#90457)
Summary:
This pull request makes some LazyGraphExecutor private data structures protected such that XLAGraphExecutor can reuse them.

Here is the list:
1. DeviceLocker.
2. DeviceLockerArena.
3. DataCacheArena.

In addition, it also introduces LazyGraphExecutor::ResetTrimCounter() such that XLAGraphExecutor can reuse the trim counter.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90457
Approved by: https://github.com/JackCaoG
2022-12-09 18:28:13 +00:00
bcf7036be5 Disable BUILD_CAFFE2 from ONNX builds (#90475)
Fixes https://github.com/microsoft/onnx-converters-private/issues/132

@kit1980 and @malfet agreed in disabling ONNX tests for Caffe2 builds.
With this change, exporting models with `operator+export_type=ONNX_ATEN_FALLBACK` will properly test non-caffe2 builds, which is the only scenario for aten fallback after caffe2 deprecation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90475
Approved by: https://github.com/kit1980, https://github.com/BowenBao
2022-12-09 18:02:48 +00:00
730e44bbc7 Add logging for aot autograd and unified debug flag (#88987)
- Adds `log_level` to aot's config
- Outputs log to `<graph_name>_<log_level>.log` in aot_torchinductor subfolder of the debug directory
- Modifies the Inductor debug context to use the graph name when naming the folder instead of the os pid
- Adds `TORCH_COMPILE_DEBUG` flag to enable it, (as well as separate dynamo and inductor logs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88987
Approved by: https://github.com/Chillee
2022-12-09 17:28:10 +00:00
983d4f6fbb [Vulkan] Enable QInt8 weights and test quantized convolution with QInt8 weights and QInt32 bias (#90441)
Summary:
- Enable convolution with QInt8 weights
- Modify test_quantized_conv2d function to allow testing with QInt8 weights and QInt32 bias.
- Added multiple tests for regular, depthwise and pointwise convolution with QInt8 weights and QInt32 bias.

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: kimishpatel

Differential Revision: D41562053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90441
Approved by: https://github.com/kimishpatel
2022-12-09 17:08:48 +00:00
282dfe8ba4 [inductor][Reland] Use decomposition for _to_copy (#90494)
Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90494
Approved by: https://github.com/ngimel
2022-12-09 16:51:50 +00:00
6581063583 Revert "Dynamo, FX, Inductor Progress Bars (#88384)"
This reverts commit db0ce4acf3c84d54e468154ead6d773539a2b597.

Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board
2022-12-09 16:32:25 +00:00
eeb3f8aa54 Add missing infer_size_symdimvector implementation. (#90405)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90405
Approved by: https://github.com/voznesenskym
2022-12-09 14:02:53 +00:00
c6c2de586d [inductor] New approach for computing triton load/store masks (#89566)
This PR changes the way masks for loads/stores are computed in triton backend of inductor.

New approach is to iterate over all variables used in indexing expression and add the corresponding mask variables to the set that will be used. For indexing variables like `x0`, `y1` and  `r3` it adds `xmask`, `ymask` and `rmask` respectively.
For indexing variables like `tmp5` (i.e., indirect indexing), it uses the new `mask_vars` attribute of the corresponding `TritonCSEVariable` object, which is populated when variable is created.

I started working on this with the aim of fixing https://github.com/pytorch/torchdynamo/issues/1654, which meanwhile was fixed by #89524 with a different approach, making this change less necessary. However note that #89524 fixes the issue by broadcasting the indices that are being loaded to a larger size, while this approach fixes it by making the mask have only the necessary terms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89566
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-12-09 12:43:19 +00:00
c8954a8907 simplify implementation of c10::isIntegralType (#90193)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90193).
* __->__ #90193

simplify implementation of c10::isIntegralType

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90193
Approved by: https://github.com/ezyang
2022-12-09 12:22:06 +00:00
6b7efac3c9 Reland "Add heirachical module names to torchFX graph.node" (#90205)
Fixes #87659

Reland of PR #87742

Resolves errors that caused the changes to be backed out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90205
Approved by: https://github.com/jerryzh168
2022-12-09 06:20:31 +00:00
0a00858095 Implement checks for vmap escaped errors (#89585)
Follow on to https://github.com/pytorch/pytorch/pull/89077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89585
Approved by: https://github.com/zou3519
2022-12-09 05:58:07 +00:00
c71b12851d [ao] public vs private for ao.quantization._X (#88392)
Summary: added all for these modules without altering names since they
tend to be experimental

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015543](https://our.internmc.facebook.com/intern/diff/D41015543)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88392
Approved by: https://github.com/jcaip
2022-12-09 05:39:29 +00:00
6050a7a3d9 [ao] backend_config moving all to top (#88391)
Summary: moved __all__ to top of functions, removed private funcitons
from all

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41015538](https://our.internmc.facebook.com/intern/diff/D41015538)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88391
Approved by: https://github.com/jcaip
2022-12-09 05:39:29 +00:00
3759777edc [threaded PG] fix long hang issue in testing (#90515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90515
Approved by: https://github.com/wanchaol
2022-12-09 05:24:08 +00:00
db0ce4acf3 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-12-09 04:32:31 +00:00
b4c27c86b7 [vision hash update] update the pinned vision hash (#90513)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90513
Approved by: https://github.com/pytorchbot
2022-12-09 03:46:40 +00:00
aacafd2cba Fixed a couple of mistakes in type annotations in optim package (#90216)
Doing some tests with all Optimizer and LRScheduler classes in optim package, I noticed a couple of mistakes in type annotations, so created a pull request to fix them.

- In Optimizer class, incorrectly named parameter `default` instead of `defaults` in pyi file
- In SGD class, type for `maximize` and `differentiable` not available in either py or pyi files

I don't know if there is a plan to move all types from pyi to py files, so wasn't too sure where to fix what.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90216
Approved by: https://github.com/janeyx99
2022-12-09 03:20:21 +00:00
78da18345e [ONNX] Extend PR approver list (#90490)
Extending the list of ONNX exporter related PR approvers. All had a long track record for contributions in PyTorch/ONNX.

@justinchuby - https://github.com/pytorch/pytorch/pulls?q=author%3Ajustinchuby
@shubhambhokare1 - https://github.com/pytorch/pytorch/pulls?q=author%3Ashubhambhokare1
@thiagocrepaldi - https://github.com/pytorch/pytorch/pulls?q=author%3Athiagocrepaldi
@titaiwangms - https://github.com/pytorch/pytorch/pulls?q=author%3Atitaiwangms
@wschin - https://github.com/pytorch/pytorch/pulls?q=author%3Awschin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90490
Approved by: https://github.com/thiagocrepaldi, https://github.com/malfet
2022-12-09 03:08:15 +00:00
797544f1c4 [dynamo][ez] Change module type to str for easier downstream parsing (#90429)
Summary:
att

Test Plan:
NA

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90429
Approved by: https://github.com/SherlockNoMad
2022-12-09 02:00:18 +00:00
f978a8b026 [quant][be] Remove special casing for getitem in prepare (#90393)
Summary:
This PR cleans up previous special casing for getitem, it should be configured through BackendConfig

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41846185](https://our.internmc.facebook.com/intern/diff/D41846185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90393
Approved by: https://github.com/andrewor14
2022-12-09 01:59:02 +00:00
6fb79b7004 Bump version: 1.14.0->2.0.0 (#90491)
Except for the usual location, had to update the version in one of ONNX expect patterns, namely here: 43660051d8/test/onnx/expect/TestOperators.test_avg_pool2d.expect (L3)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90491
Approved by: https://github.com/jansel, https://github.com/albanD
2022-12-09 01:08:08 +00:00
ff5a3592e7 Fix static initialization issue for static build (#90133)
Fixes #83255

Code comes from #83258 after fixing merge conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90133
Approved by: https://github.com/soumith, https://github.com/malfet
2022-12-09 01:01:15 +00:00
c8f5c194ca Fix bug in dynamic shapes multiply (#90336)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90336
Approved by: https://github.com/ezyang
2022-12-09 00:59:50 +00:00
2cf703214b [Composable API][Easy] Fix some follow-ups (#90471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90471
Approved by: https://github.com/mrshenli
2022-12-09 00:26:38 +00:00
eb5b4c21e1 Deepcopy GraphModule in minifier (#90401)
Fixes https://github.com/pytorch/pytorch/issues/90397. Remove deepcopy calls in minifier tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90401
Approved by: https://github.com/anijain2305, https://github.com/mlazos
2022-12-08 23:59:05 +00:00
80150788bc [21/N] Add alltoall_base custom op with CPU/CUDA implementations (#89813)
Differential Revision: [D41812670](https://our.internmc.facebook.com/intern/diff/D41812670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89813
Approved by: https://github.com/kwen2501
2022-12-08 23:39:26 +00:00
e65ee3975f [20/N] Add recv_any_source custom op with CPU/CUDA implementations (#89505)
Differential Revision: [D41812671](https://our.internmc.facebook.com/intern/diff/D41812671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89505
Approved by: https://github.com/kwen2501
2022-12-08 23:39:26 +00:00
43660051d8 [Ez] Omit HSDP Z2 from doc (#90503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90503
Approved by: https://github.com/awgu
2022-12-08 23:05:49 +00:00
912a1f7b27 Fix issue 38095 TODOs in test_quantized_tensor.py (#90344)
Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90344
Approved by: https://github.com/malfet
2022-12-08 22:28:15 +00:00
fec39f6310 Don't update vision hash on push (#90498)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90498
Approved by: https://github.com/malfet, https://github.com/seemethere
2022-12-08 22:03:24 +00:00
9bb16cd3ca Track torch.compile calls (#90310)
Title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90310
Approved by: https://github.com/colin2328, https://github.com/anijain2305
2022-12-08 21:41:15 +00:00
76f440f20a [dynamo] Rewrite inplace addcdiv and inplace add (#90330)
Rewrite inplace addcdiv to a div, mul and inplace add to avoid graph break
Rewrite inplace add to a mul and inplace add to avoid graph break

Needed to close optimizer graph breaks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90330
Approved by: https://github.com/jansel
2022-12-08 21:19:23 +00:00
0c972fb5c7 [rfc][pkg] check spec for module source before falling back to file in package exporter (#90258)
Summary: To get source for a particular module, the "correct" thing to do is to check the module's spec and use `get_source` if it's a SourceFileLoader, since subclasses may look elsewhere than the `__file__`, and the spec will give the source of truth. For torch packager, however, we prefer to use linecache, but the loader could still change the file, so we figure out the file for the module using the spec's loader rather than using `module.__file__`, if possible.

Test Plan: This code path will get exercised by CI. Also added a test for remapped files.

Differential Revision: D41412983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90258
Approved by: https://github.com/PaliC
2022-12-08 20:24:45 +00:00
e1674d7dc0 avoid fork in torch/__init__.py for deploy/multipy (#90492)
Summary:
We should not fork in deploy when initializing torch.

    Traceback (most recent call last):
    File "<string>", line 38, in <module>
    File "<string>", line 36, in __run
    File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
    File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
    File "/data/users/zyan/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/multipy/runtime/__test_py__/test_py#link-tree/multipy/runtime/test_py.py", line 61, in <module>
        import torch # has to be done serially otherwise things will segfault
    File "/data/users/zyan/fbsource/buck-out/v2/gen/fbcode/104a4d5c3a690252/multipy/runtime/__test_py__/test_py#link-tree/torch/__init__.py", line 158, in <module>
        platform.system() != 'Windows':
    File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 891, in system
        return uname().system
    File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 857, in uname
        processor = _syscmd_uname('-p', '')
    File "/usr/local/fbcode/platform010/lib/python3.8/platform.py", line 613, in _syscmd_uname
        output = subprocess.check_output(('uname', option),

Test Plan: override a local script run trigger init and set `subprocess.check_output` to None

Reviewed By: yinghai, houseroad

Differential Revision: D41848592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90492
Approved by: https://github.com/PaliC
2022-12-08 20:22:01 +00:00
b651e06049 Add Pointwise Tag from pointwise set in DTensor, use in aot_autograd partitioner (#90029)
Takes the pointwise op list from [DTensor](https://github.com/pytorch/pytorch/blob/master/torch/distributed/_tensor/ops/pointwise_ops.py#L36) as an initially starting point for pointwise ops, and feeds them to the aot autograd partitioner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90029
Approved by: https://github.com/ezyang
2022-12-08 20:21:17 +00:00
8ca1c910fb Refactor test_inductor_XXX to reduce code duplication (#90443)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90443
Approved by: https://github.com/desertfire
2022-12-08 19:58:58 +00:00
7342251281 functorch.grad support for autograd.Function (#89860)
Happy to split this PR more if it helps.

This PR adds functorch.grad support for autograd.Function. There's a lot
going on; here is the high level picture and there are more details as
comments in the code.

Mechanism (PyOperator)
- Somehow, autograd.Function needs to dispatch with functorch. This is
necessary because every layer of functorch needs to see the
autograd.Function; grad layers need to preserve the backward pass.
- The mechanism for this is via PyOperator. If functorch transforms are
active, then we wrap the autograd.Function in a `custom_function_call`
PyOperator where we are able to define various rules for functorch
transforms.
- `custom_function_call` has a rule for the functorch grad transform.

autograd.Function changes
- I needed to make some changes to autograd.Function to make this work.
- First, this PR splits autograd.Function into a _SingleLevelFunction
(that works with a single level of functorch transform) and
autograd.Function (which works with multiple levels). This is necessary
because functorch's grad rule needs some way of specifying a backward
pass for that level only.
- This PR changes autograd.Function's apply to eitehr call
`custom_function_call` (if functorch is active) or super().apply (if
functorch isn't active).

Testing
- Most of this PR is just testing. It creates an autograd.Function
OpInfo database that then gets passed to the functorch grad-based tests
(grad, vjp, vjpvjp).
- Since functorch transform tests are autogenerated from OpInfo tests,
this is the easiest way to test various autograd.Function with
functorch.

Future
- jvp and vmap support coming next
- better error message (functorch only supports autograd.Function that
have the optional setup_context staticmethod)
- documentation to come when we remove the feature flag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89860
Approved by: https://github.com/soulitzer
2022-12-08 19:31:04 +00:00
eb314f9b1a Add setup_context staticmethod to autograd.Function (#89859)
Adds a setup_context staticmethod to autograd.Function.
If it exists, then the user splits the ctx-specific logic from the
forward() and puts it in the setup_context staticmethod.

Docs will come later when we remove the feature flag.

Test Plan:
- some light tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89859
Approved by: https://github.com/soulitzer
2022-12-08 19:31:04 +00:00
103be1f164 Add feature flag for the autograd.Function extension (#89858)
This PR adds a private runtime feature flag for the feature work we're going
to do with extending autograd.Function. The motivation of the feature flag
is:
- to guard the feature against unsuspecting users
- control the release of the feature to when we are ready to release it

We might not even need the feature flag (because we hope to have the
work done in the next month), but it is good practice and it does touch
currently public API (autograd.Function).

Concretely, "autograd.Function extension" refers to:
- adding an optional `setup_context` staticmethod to autograd.Function
- adding an optional `vmap` staticmethod to autograd.Function
- autograd.Function support for functorch

Test Plan:
- new test that the feature flag works
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89858
Approved by: https://github.com/soulitzer
2022-12-08 19:31:01 +00:00
1ba5c55992 skip flaky tests (rather than expectedFailure) (#90233)
They are flaky but don't always fail. So `expectedFailure` is incorrect.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90233
Approved by: https://github.com/mruberry, https://github.com/soumith
2022-12-08 18:29:11 +00:00
e89685b0b5 Revert "[inductor] Use decomposition for _to_copy (#90314)"
This reverts commit 3fdb5f2dda7164f6282e80c39799843527d135e7.

Reverted https://github.com/pytorch/pytorch/pull/90314 on behalf of https://github.com/desertfire due to regresses performance on hf_Bert
2022-12-08 18:29:06 +00:00
b738da8c8e [LTC] Tweak LazyTensor Class for XLATensor (#90363)
Summary:
This pull request makes some tweaks on LazyTensor class such that it's easier for XLATensor to inherit.

1. It replaces data_ptr() with data() which now returns a const shared_ptr& type.
2. It adds a temporary ctor to LazyTensor::Data such that XLATensor::Data can easily inherits it.
3. It moves LazyTensor(std::shared_ptr<Data>) and SetTensorData(at::Tensor) to protected for XLATensor to access.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90363
Approved by: https://github.com/JackCaoG
2022-12-08 18:23:17 +00:00
b71c710db1 Add additional tests for view slice tensors (#86282)
Fixes https://github.com/pytorch/pytorch/issues/83995 and https://github.com/pytorch/pytorch/issues/84489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86282
Approved by: https://github.com/kulinseth
2022-12-08 17:59:55 +00:00
465005c1e0 Revert "Fix issue 38095 TODO in test_multiprocessing.py (#90335)"
This reverts commit cbb2d5af81dcfaf181db7e9083b9c41b29fdb4eb.

Reverted https://github.com/pytorch/pytorch/pull/90335 on behalf of https://github.com/clee2000 due to somehow caused test_multiprocessing to timeout cbb2d5af81 https://github.com/pytorch/pytorch/actions/runs/3645873711/jobs/6159998523
2022-12-08 17:12:10 +00:00
8ea90d926f Add support to foreach torch empty for bfloat16s (#90437)
# Summary
When training a model with SGD(..., foreach=true) found that bfloat16 model was erroring with no cuda support for empty.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90437
Approved by: https://github.com/soumith
2022-12-08 17:02:06 +00:00
d2ee94231e [inductor] Fallback for index with None in the middle of indices (#90022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90022
Approved by: https://github.com/ngimel
2022-12-08 16:18:57 +00:00
b62cfbca84 Remove TORCH_API from inline at::internal::lazy_init_num_thread (#89511)
The function signature in its current state is ambiguous.
Its an inline function that is also declared to be imported from the DLL.
which leaves it subject to compilers decision to choose one or the other and depending on what the compiler/linker may choose we may get one of the two behaviors for the `aten::init_num_threads` call:

1. Once-per-dll-in-a-thread (if its inlined)
2. Once-per-thread (if its imported)

I suspect once-per-dll-in-a-thread is already the case currently because it being tagged inline
So removing the inline will simply make it a little more consistent and clear.

The function exists to avoid repeated calls to aten::init_num_threads.
Being in an "internal" namespace, the function isnt expected to be called by external plugins which means that the "once-per-dll-in-a-thread" behavior isn't that much of a problem anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89511
Approved by: https://github.com/malfet
2022-12-08 16:18:38 +00:00
793a999ce0 Hybrid Sharded Data Parallel (#89915)
Adds 2 new hybrid sharding strategy to FSDP:
1. HYBRID_SHARD: applies zero-3 style sharding within a node, and data parallel across
2. HYBRID_SHARD_ZERO2: applies zero-2 style sharding within a node, and data parallel across

These are useful for medium sized models and aim to decrease communication volume, tests and benchmarks will be run to understand which workloads are optimal under which sharding strategy.

Hybrid sharding in general works by sharding the model using a process group within a single node, and creating intra-node process groups for replication / data parallelism. The user either needs to pass in a tuple of these process groups, or None, and we generate the process groups appropriately.

** Acknowledgements **
- @awgu 's excellent prototype: 5ad3a16d48
- @liangluofb For ideation, feedback, and initial implementation and experimentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89915
Approved by: https://github.com/awgu
2022-12-08 16:18:03 +00:00
454361435c Implement correction argument in torch.masked.{std,var} (#87118)
This makes the signature of `torch.masked.std` and `var` more consistent with the global namespace variant and also updates the sample inputs to repurpose the existing `sample_inputs_std_var` inputs which fully exercise the `correction` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87118
Approved by: https://github.com/cpuhrsch
2022-12-08 15:59:09 +00:00
a6593d6622 [Composable API][Easy] Use policy=None since that is supported (#90400)
I believe that @mrshenli used `ModuleWrapPolicy({UnitModule})` when applying `fully_shard` to `UnitModule`s because `policy=None` was not supported. However, he added that support in a previous PR, so this PR simplifies to using `policy=None` to make the intention more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90400
Approved by: https://github.com/mrshenli
2022-12-08 15:55:20 +00:00
21a0e809c2 [Composable API] Match fully_shard() comm. schedule with wrapper FSDP (#90387)
- This PR introduces a new concept, the _communication module_ (denoted `comm_module`), that represents the module responsible for the unshard/reshard pair for a `FlatParamHandle`. This is well-defined because the current design assumes that each `FlatParamHandle` only has _one_ unshard/reshard pair for either the forward or backward pass.
    - For the wrapper code path, the `comm_module` is exactly the module already being passed to the `FlatParamHandle` constructor.
    - For the composable code path, the `comm_module` is not necessarily the module already being passed to the `FlatParamHandle`. This is because the module already being passed is always the local FSDP root module to give complete FQNs, instead of local FQNs. Distinguishing the communication module from the local FSDP root module can provide more flexibility for non-recursive wrapping designs in the future.
- This PR adds a unit test `test_unshard_reshard_order` that explicitly checks that `_unshard` and `_reshard` are called in the exactly the same order across the two code paths.
- This PR does not fix `test_checkpoint_fsdp_submodules_use_reentrant`. However, the error message changes, so this PR accommodates that.
    - The error is now the same as if we used the equivalent wrapper FSDP:
    ```
    test_model.u1 = FSDP(test_model.u1, use_orig_params=True)
    test_model.u2 = FSDP(test_model.u2, use_orig_params=True)
    ```
    - The error is also the same as if we used wrapper FSDP with `use_orig_params=False`, so it is not unique to `use_orig_params=True`.

---

**`comm_module` Example**

```
model = Model(
    seq1: nn.Sequential(
        nn.Linear
        nn.ReLU
        nn.Linear
        nn.ReLU
    )
    seq2: nn.Sequential(
        nn.Linear
        nn.ReLU
        nn.Linear
        nn.ReLU
    )
)
policy = ModuleWrapPolicy({nn.Sequential})
fully_shard(model, policy=policy)
FullyShardedDataParallel(model, auto_wrap_policy=policy)
```
- This policy constructs two `FlatParamHandle`s, one for `seq1` and one for `seq2`.
- `FullyShardedDataParallel` will pass `seq1` and `seq2` as the `module` argument to the two `FlatParamHandle`s, respectively.
- `fully_shard()` will pass `model` as the `module` argument to every `FlatParamHandle`.
- `FullyShardedDataParallel` will pass `seq1` and `seq2` as the `comm_module` argument to the two `FlatParamHandle`s, respectively.
- `fully_shard()` will pass `seq1` and `seq2` as the `comm_module` argument to the two `FlatParamHandle`s, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90387
Approved by: https://github.com/mrshenli
2022-12-08 15:55:20 +00:00
4011597dd4 [Composable API] Refactor test_fully_shard.py to use common models (#90386)
Unlike for FSDP, where we already diverged to using per-test-file models, let us try to use the same set of models for the composable API effort. This can improve debugging efficiency because we know which module structures we support and which we do not _across all of our composable APIs_.

This PR had to perform some surgery for `test_materialize_meta_module`. Writing a correct parameter initialization function for meta device initialization is not easy, and we should revisit this. The old implementation, which followed the style of the previous unit tests--namely, using `module.to_empty()`--is actually incorrect for nested FSDP applications because `module.to_empty()` will re-initialize already materialized parameters and the module materialization proceeds bottom up. The existing unit test in `test_fsdp_meta.py` passes because it sets every parameter to ones (`self.weight.fill_(1)`), which is idempotent to re-initialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90386
Approved by: https://github.com/mrshenli
2022-12-08 15:32:36 +00:00
5ca4e95f6c [Composable API] Move test models to common file (#90385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90385
Approved by: https://github.com/mrshenli
2022-12-08 15:32:36 +00:00
3fdb5f2dda [inductor] Use decomposition for _to_copy (#90314)
Summary: also contains a fix for https://github.com/pytorch/pytorch/issues/89633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90314
Approved by: https://github.com/ngimel
2022-12-08 15:25:44 +00:00
dc40b6d043 Upgrade oneDNN to v2.7.2 (#90051)
This PR is to upgrade oneDNN to v2.7.2.

### oneDNN v2.7.1 & 2.7.2 changes:
Fixes #89104
Updated ITT API version to 3.23.0

### Performance Benchmark
Use TorchBench test in ICX with 40 cores
Intel OpenMP & tcmalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/205240855-04e2d50f-8b3a-4097-9038-fdd0c0fc93b9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90051
Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5
2022-12-08 09:41:02 +00:00
b485781440 Add a transform for positive-definite matrices. (#76777)
The `PositiveDefiniteTransform` is required to transform from an unconstrained space to positive definite matrices, e.g. to support testing the Wishart mode in #76690. It is a simple extension of the `LowerCholeskyTransform`.

I've also added a small test that ensures the generated data belong to the domain of the associated transform. Previously, the data generated for the inverse transform of the `LowerCholeskyTransform` wasn't part of the domain, and the test only passed because the comparison uses `equal_nan=True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76777
Approved by: https://github.com/lezcano, https://github.com/fritzo, https://github.com/soumith
2022-12-08 09:18:44 +00:00
c00b135adf Remove deprecated call to tf.io.gfile.get_filesystem (#89832)
Fixes #30966 . Fixes #47139
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89832
Approved by: https://github.com/soumith
2022-12-08 08:53:27 +00:00
ecd784667c Avoid overflow in tensorboard image summary (#90423)
Fix #90419

Added some code such that the test will update the expect files when `expecttest.ACCEPT` is True.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90423
Approved by: https://github.com/soumith
2022-12-08 08:31:52 +00:00
1978773399 [LTC] Overlap data creation and ir_value setting (#90438)
Summary:
Upstreaming changes from torch_xla to lazy tensor core: https://github.com/pytorch/xla/pull/4011.
It overlaps data creation and ir_value setting with previous executions.

To be noted, this is a clone of https://github.com/pytorch/pytorch/pull/87119, and the author is @aws-rhsoln.

Test Plan:
CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90438
Approved by: https://github.com/JackCaoG
2022-12-08 08:11:01 +00:00
9c80f13692 [Resubmit] state_dict_pre_hook (#90435)
Resubmit of https://github.com/pytorch/pytorch/pull/88541 which got stale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90435
Approved by: https://github.com/fegin
2022-12-08 07:54:14 +00:00
de016b3799 [pruning][core][feature] Implement prune for structured pruning (#89777)
Summary:

This PR implements `prune` in BaseStructuredSparsifier:

`prune` is a function that takes in a model with structured sparsity parametritizations (the result of `prepare`) and will return a resized model with the masked out weights removed.

`prune` is defined by a mapping from **patterns** to different **pruning functions**.
	- **patterns** are just sequences of operations, for example `(nn.Linear, activation, nn.Linear)`
	- **pruning functions** are functions that take in an matched pattern as args and will resize the appropriate layer sizes and weights.
	  ```
	  def prune_linear_activation_linear(linear1, activation, linear2):
		pass
	  ```
	- This is one line in the pattern config `(nn.Linear, activation, nn.Linear): prune_linear_activation_linear`

At a high level `prune` works by finding instances of the graph that match different patterns and then calling the mapped pruning functions on those matched patterns.
This is unlike the previous code which attempted to do both at the same time.

There may be some gaps in the patterns compared to the previous implementation, but the conversion functionality support should be the same.

Currently we have pruning functions for the following patterns:
    - linear -> linear
    - linear -> activation -> linear
    - conv2d -> conv2d
    - conv2d -> activation -> conv2d
    - conv2d -> activation -> pool -> conv2d
    - conv2d -> pool -> activation -> conv2d
    - conv2d -> adaptive pool -> flatten -> linear

Added in MyPy type hints as well for the prune_functions.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89777
Approved by: https://github.com/vkuzo
2022-12-08 07:13:24 +00:00
c20d41253f [LTC] Tweak LazyGraphExecutor for XLA (#90420)
Summary:
This patch moves some of the data structures from private to protected such that XLAGraphExecutor can reuse them.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90420
Approved by: https://github.com/JackCaoG
2022-12-08 06:56:23 +00:00
1a48ae96ba [PT-D][Easy] Reformat the optim code within PTD code base (#90399)
Just run two commands:
```
ufmt format torch/distributed/optim/
ufmt format test/distributed/optim/
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90399
Approved by: https://github.com/awgu
2022-12-08 06:38:59 +00:00
cbb2d5af81 Fix issue 38095 TODO in test_multiprocessing.py (#90335)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90335
Approved by: https://github.com/clee2000
2022-12-08 06:27:08 +00:00
06c98e673f [ONNX] Fix ignored small eps in layer normalization in fp16 (#89869)
Prior to this change, the symbolic_fn `layer_norm` (before ONNX version 17) always lose precision when eps is smaller than Float type, while PyTorch always take eps as Double. This PR adds `onnx::Cast` into eps related operations to prevent losing precision during the calculation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89869
Approved by: https://github.com/BowenBao
2022-12-08 06:13:09 +00:00
5f3ca208c5 Revert "add save and load stats in memory_tracker (#90144)"
This reverts commit 1f137c1e2f738d9021b5e22fb6e52d41b780a1a8.

Reverted https://github.com/pytorch/pytorch/pull/90144 on behalf of https://github.com/ezyang due to dirty git working copy broke master
2022-12-08 05:16:56 +00:00
22a249e44e Revert "[Inductor] More robust stride and offset extraction from index expressions (#90184)"
This reverts commit 71f27f768839394ec226c37a763bd524d8589f07.

Reverted https://github.com/pytorch/pytorch/pull/90184 on behalf of https://github.com/ngimel due to catastrophically regresses performance
2022-12-08 05:04:15 +00:00
25eb7c3ae3 Clean up dependancy for flatbuffer_loader (#86041)
Test Plan: waitforsandcastle

Differential Revision: D38445936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86041
Approved by: https://github.com/cccclai
2022-12-08 03:48:04 +00:00
37892041a1 Always compile tiny graphs with AOTAutograd (#89775)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89775
Approved by: https://github.com/anjali411, https://github.com/bdhirsh
2022-12-08 03:41:29 +00:00
b8b7480065 [Checkpoint][2D][6/N] Add optimizer and update default_planner to core distributed (#90212)
This is the last PR for integrating 2D into core distributed.

This PR does the following:
1. Add optimizer.py: this adds ability to load a state_dict in conjunction with FSDP sharded optimzer state.
2. Update default_planner.py to support 2D checkpoint.
3. Add test_fsdp_optim_state.py as a unit test for No. 1.
4. Fix bug in torch/testing/_internal/distributed/checkpoint_utils.py
5. Rename the filename for the APIs that should be private. Will organize and cleanup further in following PRs. #90328

Docstring and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90212
Approved by: https://github.com/wanchaol
2022-12-08 02:53:29 +00:00
36ac095ff8 Migrate PyTorch to C++17 (#85969)
With CUDA-10.2 gone we can finally do it!

This PR mostly contains build system related changes, invasive functional ones are to be followed.
Among many expected tweaks to the build system, here are few unexpected ones:
 - Force onnx_proto project to be updated to C++17 to avoid `duplicate symbols` error when compiled by gcc-7.5.0, as storage rule for `constexpr` changed in C++17, but gcc does not seem to follow it
 - Do not use `std::apply` on CUDA but rely on the built-in variant, as it results in test failures when CUDA runtime picks host rather than device function when `std::apply` is invoked from CUDA code.
 - `std::decay_t` -> `::std::decay_t` and `std::move`->`::std::move` as VC++ for some reason claims that `std` symbol is ambigious
 - Disable use of `std::aligned_alloc` on Android, as its `libc++` does not implement it.

Some prerequisites:
 - https://github.com/pytorch/pytorch/pull/89297
 - https://github.com/pytorch/pytorch/pull/89605
 - https://github.com/pytorch/pytorch/pull/90228
 - https://github.com/pytorch/pytorch/pull/90389
 - https://github.com/pytorch/pytorch/pull/90379
 - https://github.com/pytorch/pytorch/pull/89570
 - https://github.com/facebookincubator/gloo/pull/336
 - https://github.com/facebookincubator/gloo/pull/343
 - 919676fb32

Fixes https://github.com/pytorch/pytorch/issues/56055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85969
Approved by: https://github.com/ezyang, https://github.com/kulinseth
2022-12-08 02:27:48 +00:00
f2d95765e4 [pthreadpool] Set max threadlimit to tsan limit (#89453)
Summary:
This will make sure we don't run into an internal assert for clang tsan which has a cap of 63 on concurrently held lock count.
Seems like it is failing with 64 since the comparison is `<`, so setting it to 63 here.

```
llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))"
```

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
CI

Sandcastle run

Reviewed By: kimishpatel, salilsdesai

Differential Revision: D41444710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89453
Approved by: https://github.com/mcr229
2022-12-08 02:02:53 +00:00
772b726068 Revert "Disable dynamo tracing torchrec.distributed (#90087)" (#90416)
This reverts commit 7e9a8a1361a090cee86544a3c029b9b4ed622e9c.

This revert fixes a torchbench dlrm amp crash.  Auto revert fails due to conflict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90416
Approved by: https://github.com/yanboliang, https://github.com/malfet
2022-12-08 01:50:54 +00:00
00118f5c30 Fix issue 38095 TODO in test_jit_fuser_te.py (#90246)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90246
Approved by: https://github.com/clee2000
2022-12-08 01:39:26 +00:00
ad188a227e Introduce CUDA Device Assertions Infrastructure (#84609)
Summary:
This diff introduces a set of changes that makes it possible for the host to get assertions from CUDA devices. This includes the introduction of

**`CUDA_KERNEL_ASSERT2`**

A preprocessor macro to be used within a CUDA kernel that, upon an assertion failure, writes the assertion message, file, line number, and possibly other information to UVM (Managed memory). Once this is done, the original assertion is triggered, which places the GPU in a Bad State requiring recovery. In my tests, data written to UVM appears there before the GPU reaches the Bad State and is still accessible from the host after the GPU is in this state.

Messages are written to a multi-message buffer which can, in theory, hold many assertion failures. I've done this as a precaution in case there are several, but I don't actually know whether that is possible and a simpler design which holds only a single message may well be all that is necessary.

**`TORCH_DSA_KERNEL_ARGS`**

This preprocess macro is added as an _argument_ to a kernel function's signature. It expands to supply the standardized names of all the arguments needed by `C10_CUDA_COMMUNICATING_KERNEL_ASSERTION` to handle device-side assertions. This includes, eg, the name of the pointer to the UVM memory the assertion would be written to. This macro abstracts the arguments so there is a single point of change if the system needs to be modified.

**`c10::cuda::get_global_cuda_kernel_launch_registry()`**

This host-side function returns a singleton object that manages the host's part of the device-side assertions. Upon allocation, the singleton allocates sufficient UVM (Managed) memory to hold information about several device-side assertion failures. The singleton also provides methods for getting the current traceback (used to identify when a kernel was launched). To avoid consuming all the host's memory the singleton stores launches in a circular buffer; a unique "generation number" is used to ensure that kernel launch failures map to their actual launch points (in the case that the circular buffer wraps before the failure is detected).

**`TORCH_DSA_KERNEL_LAUNCH`**

This host-side preprocessor macro replaces the standard
```
kernel_name<<<blocks, threads, shmem, stream>>>(args)
```
invocation with
```
TORCH_DSA_KERNEL_LAUNCH(blocks, threads, shmem, stream, args);
```
Internally, it fetches the UVM (Managed) pointer and generation number from the singleton and append these to the standard argument list. It also checks to ensure the kernel launches correctly. This abstraction on kernel launches can be modified to provide additional safety/logging.

**`c10::cuda::c10_retrieve_device_side_assertion_info`**
This host-side function checks, when called, that no kernel assertions have occurred. If one has. It then raises an exception with:
1. Information (file, line number) of what kernel was launched.
2. Information (file, line number, message) about the device-side assertion
3. Information (file, line number) about where the failure was detected.

**Checking for device-side assertions**

Device-side assertions are most likely to be noticed by the host when a CUDA API call such as `cudaDeviceSynchronize` is made and fails with a `cudaError_t` indicating
> CUDA error: device-side assert triggered CUDA kernel errors

Therefore, we rewrite `C10_CUDA_CHECK()` to include a call to `c10_retrieve_device_side_assertion_info()`. To make the code cleaner, most of the logic of `C10_CUDA_CHECK()` is now contained within a new function `c10_cuda_check_implementation()` to which `C10_CUDA_CHECK` passes the preprocessor information about filenames, function names, and line numbers. (In C++20 we can use `std::source_location` to eliminate macros entirely!)

# Notes on special cases

* Multiple assertions from the same block are recorded
* Multiple assertions from different blocks are recorded
* Launching kernels from many threads on many streams seems to be handled correctly
* If two process are using the same GPU and one of the processes fails with a device-side assertion the other process continues without issue
* X Multiple assertions from separate kernels on different streams seem to be recorded, but we can't reproduce the test condition
* X Multiple assertions from separate devices should be all be shown upon exit, but we've been unable to generate a test that produces this condition

Differential Revision: D37621532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84609
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-08 01:26:07 +00:00
f99f239531 Fix issue 38095 TODOs in gloo tests (#89985)
Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89985
Approved by: https://github.com/ZainRizvi
2022-12-08 01:12:37 +00:00
1ba94b3882 Support pickle version 4 by adding missing ops (#90223)
Summary:
In this logic, we are traversing the entries to find the module for STACK_GLOBAL entries.

According to 2837241f22/Lib/pickletools.py (L1799) we need to look for GET, BINGET and LONG_BINGET.

So this diff updates that. Also while testing, I found some cases of empty modules, for cases such as tanh. For this I added the option to skip processing when this is the case.

Test Plan: Tested with f392778829

Differential Revision: D41748595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90223
Approved by: https://github.com/PaliC
2022-12-08 01:06:40 +00:00
d5c6a74699 Rewrite dynamo cond() handling to not recursively call export (#90286)
The original implementation of cond() operator support in dynamo operated by recursively calling export() on the inner subgraph.  This is problematic for a number of reasons:

* My original motivating reason: the original implementation had to play tricks to feed real tensors to the recursive export call, which means that it doesn't work well with tracing with dynamic shapes (where we MUST stay in fake tensors to accurately track dynamic shapes across the cond invocation)
* If there are pending side effects, the recursive export() call won't see those side effects (as they are only tracked by Dynamo, not actually applied to the Python environment.) You can see an example where dynamo cond tracing does the wrong thing at https://github.com/pytorch/pytorch/pull/90208
* If there were side effects inside the true/false branch, these side effects were silently lost (as the export only returns the graph of tensor operations, and not any of the residual Python bytecodes necessary to reapply any side effects.) This could have substantive effects on the export of subsequent parts of the model, as those parts of the models could rely on the side effects.
* It was not possible to track NN module accesses inside the true/false branches, necessitating a hack where the NN module was explicitly passed in as an input to cond https://github.com/pytorch/pytorch/pull/87020#issuecomment-1338842844 which doesn't really make any sense from a backend compilation perspective
* Guards induced from the inside of the true/false branch were not properly propagated to the top level guards; they were just silently dropped (in fact, the original implementation checked that the true/false branch produce the same guards which... is not useful? Like, I don't think that actually is even necessary for correctness)

This PR replaces the old implementation with a new implementation based on graphstate checkpointing. The basic idea is to process a cond(), we checkpoint the state of our interpreter, run the true branch, rollback to our checkpoint, run the false branch, rollback to our checkpoint and then merge the changes from both of the checkpoints. I require the true/false branches to have exactly the same side effects, but union their guards.

Some of the details:

* Dynamo is too aggressive with tracking side effects when processing closures, c.f. https://github.com/pytorch/torchdynamo/pull/233/files#r1040480078 The basic problem is whenever I define a closure, this immediately counts as a side effect, even if I didn't actually mutate anything. This triggered on the nested cond export example. To prevent this from happening, I optimistically avoid tracking side effects, but if a STORE_DEREF happens, I restart analysis with the relevant Source.name() added to `mutated_closure_cell_contents` so we start tracking on closure allocation. This is enough to fix the relevant test.
* For the most part, I assert that the graph states must be equivalent after applying the true/false branches. During debugging, I found it useful to be able to compare two graph states and give a better description about what the divergence was. You can test this using the `diff()` method I've added to a few structures.
* The implementation now supports NestedUserFunctionVariable, which is nice as it allows the true/false branches to be defined closer to the cond implementation.
* I fixed the naming of the true/false subgraphs; previously they were named `name_0`, `name_1`, now they are named `cond_true_0` and `cond_false_0`
* I added `name_to_input` to the saved graph state. I don't actually know if this is necessary, but it seemed like a good idea.
* I have to play some tricks to get the speculating execution of the true/false branch to record into a subgraph. After a careful read of OutputGraph, I found that what would work is overriding graph with a fresh Graph that we want to write things into, and manually setting up the inputs/outputs. It's a little delicate as you have to make sure you reset the Graph to its original before you restore a checkpoint, as checkpoints don't actually save graph for efficiency, and just undo changes on the graph. This capability may usefully get refactored to OutputGraph but I didn't do it in this PR for simplicity.

There are some further problems with the cond() implementation that I leave for future work. Most of these were preexisting with the original implementation.

* Not a problem per se, but if an NN module is used by both the true/false branch, it will show up in the final graph twice (since it has to be a submodule of the GraphModule that makes use of it.) I hope the export pipeline can deal with this.
* List of tensor output for cond is not supported.
* The true/false return values may not have consistent sizes/dims/etc, and we don't check them for consistency.
* If we modify fake tensors in the true/false branches, we aren't rolling them back, c.f. https://github.com/pytorch/torchdynamo/issues/1840

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90286
Approved by: https://github.com/voznesenskym
2022-12-08 01:05:12 +00:00
54d344b0b7 Type torch._dynamo.side_effects (#90202)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90202
Approved by: https://github.com/voznesenskym
2022-12-08 01:05:12 +00:00
ca5f69ef19 Convert InstructionTranslatorGraphState and OutputGraphState to NamedTuple (#90186)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90186
Approved by: https://github.com/voznesenskym
2022-12-08 01:05:12 +00:00
1119aac485 Type torch._dynamo.symbolic_convert (#90185)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90185
Approved by: https://github.com/voznesenskym
2022-12-08 01:05:12 +00:00
7abd035b2f Add missing mypy-nofollow.ini (#90179)
I'm not sure how lintrunner worked without this lol.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90179
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2022-12-08 01:05:12 +00:00
47071c3d47 [quant] Add support for symmetric quant in executorch (#90304)
Summary:
This PR adds symmetric quant in the backend config for executorch

Test Plan:
NA, will be tested in meta internal flow

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90304
Approved by: https://github.com/cccclai, https://github.com/jcaip, https://github.com/andrewor14
2022-12-08 01:03:00 +00:00
9f7bc7bc24 Revert "[Quant][fx][bc-breaking] Make convert.py smaller (#90189)"
This reverts commit 824641b083860df4d7ffef06a798ea2702bc4bde.

Reverted https://github.com/pytorch/pytorch/pull/90189 on behalf of https://github.com/seemethere due to Fails internal tests due to potential circular import, see https://www.internalfb.com/diff/D41817429?dst_version_fbid=1453307181865235&transaction_fbid=899728221278938
2022-12-08 00:51:13 +00:00
d7c30e11c6 [inductor] Remove .to from lowering (#90280)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90280
Approved by: https://github.com/ngimel
2022-12-08 00:40:41 +00:00
b8b439aede C++17 friendly iterator implementation (#90379)
Get rid of std::iterator inheritance/references for `c10::DictIterator`, `c10::IListRefIterator` and `c10::ListIterator`

Followup after https://github.com/pytorch/pytorch/pull/90174

Fixes deprecation warning and extension compilation failures using VC++
that raises following errors:
```
C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\include\ATen/core/IListRef.h(517): error C4996: 'std::iterator<std::bidirectional_iterator_tag,T,ptrdiff_t,T *,T &>::value_type': warning STL4015: The std::iterator class template (used as a base class to provide typedefs) is deprecated in C++17. (The <iterator> header is NOT deprecated.) The C++ Standard has never required user-defined iterators to derive from std::iterator. To fix this warning, stop deriving from std::iterator and start providing publicly accessible typedefs named iterator_category, value_type, difference_type, pointer, and reference. Note that value_type is required to be non-const, even for constant iterators. You can define _SILENCE_CXX17_ITERATOR_BASE_CLASS_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning.

C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\include\ATen/core/List.h(169): error C4996: 'std::iterator<std::random_access_iterator_tag,T,ptrdiff_t,T *,T &>::difference_type': warning STL4015: The std::iterator class template (used as a base class to provide typedefs) is deprecated in C++17. (The <iterator> header is NOT deprecated.) The C++ Standard has never required user-defined iterators to derive from std::iterator. To fix this warning, stop deriving from std::iterator and start providing publicly accessible typedefs named iterator_category, value_type, difference_type, pointer, and reference. Note that value_type is required to be non-const, even for constant iterators. You can define _SILENCE_CXX17_ITERATOR_BASE_CLASS_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning.

```

Discovered while working on https://github.com/pytorch/pytorch/pull/85969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90379
Approved by: https://github.com/ezyang, https://github.com/dagitses
2022-12-08 00:30:20 +00:00
5351176caa Kineto activity fix (#89785)
Continuation of https://github.com/pytorch/pytorch/pull/88207

A compile time guard was preventing ActivityType::CUDA from being available on rocm. This caused both the GPU_FALLBACK and CUDA modes to be active at the same time. So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times. This caused incorrect (and often negative) cuda times, in e.g. table().

Previously a cmake variable was not being propagated to a '-D', causing an issue on Windows, which uses cuda but not cupti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89785
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-12-08 00:24:55 +00:00
79406378ae [primTorch] Add prim and ref for as_strided_scatter (#88426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88426
Approved by: https://github.com/mruberry
2022-12-08 00:17:39 +00:00
1f137c1e2f add save and load stats in memory_tracker (#90144)
add save and load stats in memory_tracker, so that users could plot the traces in another place, rather than just inside trainer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90144
Approved by: https://github.com/rohan-varma
2022-12-08 00:17:21 +00:00
bc93454e4a correctly set strides for expanded/unsqueezed dimensions (#90341)
Fixes https://github.com/pytorch/torchdynamo/issues/1959, #90260
However, I wasn't able to make existing stride tests fail before the fix, even though I'm comparing all, not just significant strides.
Separately running refs on meta tensors produces wrong strides as shown in #90260, however, it looks like in meta tests some other way of computing meta info is used (I've been running
```
pytest -s -v test/test_meta.py -k test_meta_outplace_expand_cuda_float64
```
and verified that it has sample input that should fail, and that it indeed compares all the strides, but the produced `meta_rs` results somehow still had correct strides).

Edit: @SherlockNoMad helped me figure out how to fail the tests, and now I've set the correct ops for checking. `expand` fails for some test inputs because it special-cases 0-dim input case, correctly modeling it in prims would require a lot of changes, so skipping that for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90341
Approved by: https://github.com/SherlockNoMad
2022-12-07 23:38:33 +00:00
50ec416599 Fix C2 Ambiguous namespace (#89534)
Summary: cuda:: is a ambiguous namespace. Make it explicit c10::cuda

Differential Revision: D41469007
/caffe2/caffe2/core/context_gpu.cu(564): error: "caffe2::cuda" is ambiguous/caffe2/caffe2/core/context_gpu.cu(564): error: expected a ";"/caffe2/caffe2/core/context_gpu.cu(568): warning #12-D: parsing restarts here after previous syntax error
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"/caffe2/caffe2/core/context_gpu.cu(569): error: "caffe2::cuda" is ambiguous/caffe2/caffe2/core/context_gpu.cu(628): error: "caffe2::cuda" is ambiguous
4 errors detected in the compilation of "/caffe2/caffe2/core/context_gpu.cu".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89534
Approved by: https://github.com/malfet
2022-12-07 23:36:41 +00:00
56ab94d6e4 [Vulkan][TCC] Add tests for quantized convolution with QUInt8 activation, weights and bias (#90012)
Summary:
- Registered vulkan_prepack::create_qconv2d_context to the QuantizedCPU backend.
- Registered vulkan_prepack::run_qconv2d_context to the Vulkan backend.
- Added function test_quantized_conv2d, in order to test Vulkan Quantized Conv2d with QUInt8 activation, weight and bias (all QUInt8).
- Added multiples tests for vulkan quantized conv2d (regular, depthwise and pointwise). All these tests make use of the test_quantized_conv2d function.

This function tests the correctness of vulkan quantized conv2d, by comparing the following two processes:
(we start with randomly generated float cpu tensors)
- random float cpu tensors -> to vulkan -> quantize them -> apply vulkan conv2d quantized op -> dequantize result -> to cpu
- random float cpu tensors -> quantize them -> dequantize -> apply cpu floating point conv2d op on dequantized tensors -> quantize result -> dequantize

This function takes three boolean flags that modify its behavior:
- prepacking:
  - if false, then we directly call at::native::vulkan::ops::quantized_conv2d
  - if true, then we call vulkan_prepack::create_qconv2d_context and vulkan_prepack::run_qconv2d_context.
- compute_quantization_params & random_quantization_params:
  - if both are false, all quantization params are fixed (given as input)
  - if compute_quantization_params is true, all params are computed
  - if random_quantization_params is true, the input params are random and the output params are computed.
(compute_quantization_params takes precedence over random_quantization_params)

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: SS-JIA

Differential Revision: D41047096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90012
Approved by: https://github.com/salilsdesai
2022-12-07 23:21:57 +00:00
e0f681aa85 Add manual cuda deps search logic (#90411)
If PyTorch is package into a wheel with [nvidia-cublas-cu11](https://pypi.org/project/nvidia-cublas-cu11/), which is designated as PureLib, but `torch` wheel is not, can cause a torch_globals loading problem.

Fix that by searching for `nvidia/cublas/lib/libcublas.so.11` an `nvidia/cudnn/lib/libcudnn.so.8` across all `sys.path` folders.

Test plan:
```
docker pull amazonlinux:2
docker run --rm -t amazonlinux:2 bash -c 'yum install -y python3 python3-devel python3-distutils patch;python3 -m pip install torch==1.13.0;curl -OL https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/90411.diff; pushd /usr/local/lib64/python3.7/site-packages; patch -p1 </90411.diff; popd; python3 -c "import torch;print(torch.__version__, torch.cuda.is_available())"'
```

Fixes https://github.com/pytorch/pytorch/issues/88869

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90411
Approved by: https://github.com/atalman
2022-12-07 23:06:51 +00:00
3ef4fc2012 Automated submodule update: FBGEMM (#74729)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: f99e161663

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74729
Approved by: https://github.com/malfet
2022-12-07 22:36:35 +00:00
ecd418673b [FSDP][Easy] ufmt files (#90384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90384
Approved by: https://github.com/H-Huang
2022-12-07 21:18:23 +00:00
32973651e6 [Vulkan] Enable copying QInt8 and QInt32 tensors from cpu to vulkan. (#90357)
Summary:
Copying QInt8 and QInt32 from cpu to vulkan:
 - Added shader nchw_to_image_int8
 - Added shader nchw_to_image_int32

Copying QInt8 and QInt32 from vulkan to cpu
Note: This functionality is currently disabled until issues on Android are resolved.
- Added shader image_to_nchw_int32
- QInt8 works with the same existing image_to_nchw_quantized shaders

Added multiple tests for each supported dtype:
- cpu_to_vulkan_and_dequantize:
These tests check the correctness of copying quantized cpu tensor to vulkan by comparing the output of the following:
  - cpu float tensor -> quantize -> to vulkan -> dequantize -> to cpu
  - cpu float tensor -> quantize -> dequantize
- cpu_to_vulkan_and_vulkan_to_cpu
(currently disabled until copying vulkan quantized to cpu is enabled):
These tests check the correctness of copying from cpu to vulkan and from vulkan to cpu by creating a random cpu float tensor, quantizing it, then copying it to vulkan, then back to cpu and comparing the output tensor to the original quantized tensor.
- quantize_per_tensor_and_vulkan_to_cpu
(currently disabled until copying vulkan quantized to cpu is enabled):
These tests check the correctness of copying quantized tensor from vulkan to cpu by comparing the output of the following:
  - cpu float tensor -> to vulkan -> quantize -> to cpu
  - cpu float tensor -> quantize

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: kimishpatel

Differential Revision: D41654287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90357
Approved by: https://github.com/SS-JIA
2022-12-07 21:17:35 +00:00
a076bdb357 [fx] Copy codegen in legalize_graph (#90023)
Test Plan: CI

Differential Revision: D41666330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90023
Approved by: https://github.com/SherlockNoMad
2022-12-07 21:09:38 +00:00
6dcc214ac2 Fix AssertionError fake_mode is not None in distributed (#90392)
Fixes https://github.com/pytorch/pytorch/issues/90375

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90392
Approved by: https://github.com/voznesenskym
2022-12-07 20:12:39 +00:00
2ad6ed8ac9 Fix some typed storage is deprecated warnings. (#89867)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89867
Approved by: https://github.com/albanD
2022-12-07 20:09:57 +00:00
1b1301f16a Revert "[pruning][core][feature] Implement prune for structured pruning (#89777)"
This reverts commit 3531e44307fa58460e2488bcaace948678d6cf9f.

Reverted https://github.com/pytorch/pytorch/pull/89777 on behalf of https://github.com/clee2000 due to breaking test_ao_sparcity due to import 3531e44307 https://github.com/pytorch/pytorch/actions/runs/3641476330/jobs/6147830487, probably a landrace with 824641b083860df4d7ffef06a798ea2702bc4bde?
2022-12-07 19:41:15 +00:00
44779d9bc6 [FSDP][optim_state_dict][2/N] Add _get_fqn_to_fsdp_param_info to map from original FQN to flat_param (#89899)
**Motivation:**
Add a helper to map from the FQN to the corresponding flat_param. The helper will directly get flat_param from fsdp_state and flat_handler as flat_param is not registered to the module if `use_orig_params` is True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89899
Approved by: https://github.com/awgu
2022-12-07 19:40:47 +00:00
f7cdd3a7a0 [inductor] Use a large tolerance for botnet26t_256 (#90383)
Summary: botnet26t_256 shows random tolerance failure on CI. The root
cause of this randomness is still to-be-invesitgated, but let's use a
larger tolerance for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90383
Approved by: https://github.com/ezyang
2022-12-07 19:35:06 +00:00
2b0b4bb6fd [Dynamo] Fix llvm target for meta schedule & add torch to tvm ndarray helper func (#90214)
Fixes #90213. Also a torch.tensor to tvm.nd.array helper function is added to avoid data copy with dlpack.

@jansel @Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90214
Approved by: https://github.com/wconstab
2022-12-07 19:23:56 +00:00
6a7659f304 Fix issue 38095 TODO in test_autograd.py (#90031)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90031
Approved by: https://github.com/clee2000
2022-12-07 19:09:43 +00:00
4b1053497c [vmap] Prepend "legacy" to files for old vmap implementation (#90324)
We have an older torch.vmap implementation. It is no longer supported.
It still needs to exist somewhere for the sake of BC with
torch.autograd.functional.

This PR makes it clear what files are meant for implementing the old
vmap implementation. I've seen a couple of PRs recently adding support
for the old vmap implementation, so this will lessen the confusion.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90324
Approved by: https://github.com/samdow
2022-12-07 18:46:15 +00:00
94d800ffd1 Make Transformers compilable by C++17 (#90389)
`register` keyword is removed in C++17, but keeping it there under ifdef
as I have not measured the perf implication on older compiler, though
there shouldn't be any: all modern compilers supposed to downright
ignore it.

This code originates from https://github.com/facebookresearch/xformers/pull/375 will propose similar PR to remove register keyword usage to that repo.

Yet another thing discovered while working on https://github.com/pytorch/pytorch/pull/85969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90389
Approved by: https://github.com/drisspg
2022-12-07 18:10:44 +00:00
3531e44307 [pruning][core][feature] Implement prune for structured pruning (#89777)
Summary:

This PR implements `prune` in BaseStructuredSparsifier:

`prune` is a function that takes in a model with structured sparsity parametritizations (the result of `prepare`) and will return a resized model with the masked out weights removed.

`prune` is defined by a mapping from **patterns** to different **pruning functions**.
	- **patterns** are just sequences of operations, for example `(nn.Linear, activation, nn.Linear)`
	- **pruning functions** are functions that take in an matched pattern as args and will resize the appropriate layer sizes and weights.
	  ```
	  def prune_linear_activation_linear(linear1, activation, linear2):
		pass
	  ```
	- This is one line in the pattern config `(nn.Linear, activation, nn.Linear): prune_linear_activation_linear`

At a high level `prune` works by finding instances of the graph that match different patterns and then calling the mapped pruning functions on those matched patterns.
This is unlike the previous code which attempted to do both at the same time.

There may be some gaps in the patterns compared to the previous implementation, but the conversion functionality support should be the same.

Currently we have pruning functions for the following patterns:
    - linear -> linear
    - linear -> activation -> linear
    - conv2d -> conv2d
    - conv2d -> activation -> conv2d
    - conv2d -> activation -> pool -> conv2d
    - conv2d -> pool -> activation -> conv2d
    - conv2d -> adaptive pool -> flatten -> linear

Added in MyPy type hints as well for the prune_functions.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89777
Approved by: https://github.com/vkuzo
2022-12-07 17:52:01 +00:00
d680ea7e36 [quant]Fix public bindings for DTypeWithConstraint (#90315)
Summary:
Need this to fix `test_public_bindings`.

Test Plan:
`python test/test_public_bindings.py`
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90315
Approved by: https://github.com/HDCharles
2022-12-07 17:52:01 +00:00
4cdc96fb4f Add hooks structure for passing around user provided hooks, add a new guard_failure_fn (#90371)
This PR introduces a new function we can pass to torch._dynamo.optimize - guard_failure_fn. Usage is in the PR, and the one stacked on top of it, but the gist of it is that it emits failed guard reason strings alongside code. This is useful for tests and debugging, as it gives far finer grained assertions and control than the compile counter alone.

This is a resubmit of https://github.com/pytorch/pytorch/pull/90129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90371
Approved by: https://github.com/ezyang
2022-12-07 17:51:53 +00:00
c92cf6bee3 [BE][CI] Add windows test run instructions (#90388)
Specifies how to activate VisualStudio, Anaconda and set `PYTHONPATH` to run tests in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90388
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2022-12-07 17:41:54 +00:00
824641b083 [Quant][fx][bc-breaking] Make convert.py smaller (#90189)
Summary: This commit moves helper functions that are not core
to the convert logic out of convert.py, which was more than
1000 lines. This helps with readability since a new developer
won't have to scroll through hundreds of lines of util functions
to understand the core logic. There should be no change in
functionality in this commit.

BC-breaking notes: The following helper functions that were
previously exposed under the `torch.ao.quantization.fx.convert`
namespace are now made private. Many of these are moved to the
new convert_utils.py
```
convert_custom_module
convert_standalone_module
convert_weighted_module
get_module_path_and_prefix,
has_none_qconfig,
insert_dequantize_node,
is_conversion_supported,
maybe_recursive_remove_dequantize,
replace_observer_or_dequant_stub_with_dequantize_node,
restore_state,
run_weight_observers,
```

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90189
Approved by: https://github.com/jerryzh168
2022-12-07 16:16:25 +00:00
99fb39f508 reland #89243: [Composable API] replicate: add support for DDP args (#90255)
reland https://github.com/pytorch/pytorch/pull/89243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90255
Approved by: https://github.com/zhaojuanmao
2022-12-07 15:22:33 +00:00
e6a7278753 Give std/var correction overloads proper defaults (#56398)
The correction overloads defaults were left off for forward
compatibility reasons, but this FC window expired well over a year ago
at this point.

Differential Revision: [D29625593](https://our.internmc.facebook.com/intern/diff/D29625593)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56398
Approved by: https://github.com/mruberry
2022-12-07 15:15:00 +00:00
b0bd5c4508 [MPS] Fix median_out_mps caching (#90326)
We should cache graph based on input tensor type

Fixes https://github.com/pytorch/pytorch/issues/90311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90326
Approved by: https://github.com/kulinseth
2022-12-07 07:24:58 +00:00
85ae28b454 Reformat optim import (#90294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90294
Approved by: https://github.com/awgu
2022-12-07 07:11:12 +00:00
15949fc248 [ROCm] Enable few test_prim UTs for ROCm (#88983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88983
Approved by: https://github.com/IvanYashchuk, https://github.com/jeffdaily, https://github.com/malfet
2022-12-07 06:21:31 +00:00
26d1dbc4f8 [inductor] More correct check for fbcode environment (#90312)
Summary:
importing torch.fb seemed like a good idea, but we don't always have
torch.fb inside fbcode.  Testing for torch.version.git_version is more
reliable, since we'll never have a git_version inside fbcode, which is an hg
repo.

Test Plan: `buck2 run mode/dev-nosan //caffe2/test/inductor:smoke`

Reviewed By: soumith, jansel

Differential Revision: D41777058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90312
Approved by: https://github.com/soumith
2022-12-07 04:50:11 +00:00
351d73b97f Fix exception causes all over the codebase (#90271)
This is the continuation to #90134 and hopefully the final PR in this series.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90271
Approved by: https://github.com/kit1980
2022-12-07 04:29:00 +00:00
8f079b895b [Dynamo+FSDP] Update benchmarks with use_orig_params=True (#90100)
After https://github.com/pytorch/pytorch/pull/89523, we now need to assert use_orig_params=True, even in the non-recursive case where (I think) we wouldn't otherwise need to run with use_orig_params=True.

Tested with `python benchmarks/dynamo/torchbench.py --training --accuracy --only hf_T5 --fsdp`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90100
Approved by: https://github.com/wconstab
2022-12-07 03:33:58 +00:00
898b46d6cc [Dynamo][Easy] capture more exceptions when import skip modules (#90338)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90338
Approved by: https://github.com/williamwen42
2022-12-07 02:05:39 +00:00
71f27f7688 [Inductor] More robust stride and offset extraction from index expressions (#90184)
Currently the stride and offset are determined by substituting 1 and 0 for
different indices, which will fail for any expression that doesn't match the
expected stride calculation. Instead, this uses `sympy.match` and returns `None`
for any variables used in non-standard index calculation, e.g. `torch.roll`
which uses `ModularIndexing`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90184
Approved by: https://github.com/jansel
2022-12-07 01:43:21 +00:00
4f44877983 [Inductor] Add test for Scheduler fusions (#90014)
Currently there is `test_vertical_fusion1` which fuses entirely during
the lowering stage and no buffers are realized. This adds
`test_scheduler_vertical_fusion1` which is the same test but with
several intermediate calculations realized so the scheduler is left
to do the fusion.

To support the test, this PR also adds:
- `metrics.ir_nodes_pre_fusion` which when compared with
`generated_kernel_count` tells us how many nodes were fused.
- `torch._test_inductor_realize` which is an identity operator in
eager, but under inductor also forces the input to be realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90014
Approved by: https://github.com/jansel
2022-12-07 01:33:25 +00:00
13fcc412be [Quant][fx][bc-breaking] Remove unused functions in fx/utils.py (#90025)
Summary and BC-breaking notes: This commit removes the following
unused functions from both the `torch.quantization` and the
`torch.ao.quantization` namespaces:

```
graph_pretty_str
get_per_tensor_qparams
quantize_node
get_qconv_op
create_qparam_nodes
node_return_type_is_int
is_get_tensor_info_node
```

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestAOMigrationQuantizationFx

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90025
Approved by: https://github.com/HDCharles
2022-12-07 01:31:28 +00:00
f28927e9c4 Revert "[MPS] Fix median_out_mps caching (#90326)"
This reverts commit 23c192c3df2fd53a2110d179eabb549ceb7beeef.

Reverted https://github.com/pytorch/pytorch/pull/90326 on behalf of https://github.com/malfet due to Modified wrong key
2022-12-07 00:43:31 +00:00
887249b2bb [quant] Add fused "q - qlinear - dq" operator with skipped quant op for output of linear (#89882)
Summary:
Added two ops:
* torch.ops.quantized.linear_with_input_q_dq_qweight_dq_output_fp32
* torch.ops.quantized.linear_with_input_q_dq_qweight_dq_relu_output_fp32

corresponding pattern for `linear_with_input_q_dq_qweight_dq_output_fp32` would be:
```
input -> q* -> dq* -> linear* ->
           qweight -> dq* /
```

Test Plan:
python test/test_quantization.py -k TestQuantizedLinear.test_qlinear_with_input_q_dq_qweight_dq

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89882
Approved by: https://github.com/vkuzo
2022-12-07 00:10:19 +00:00
22e363348c [Vulkan] Partially fix and then disable copying of vulkan quantized tensors to cpu (#90275)
Summary:
Before this diff, copying of vulkan quantized tensors to cpu was broken. This was mainly caused because the shader only works properly with specific global and local work group sizes, and those specific sizes had been modified in earlier refactoring.

As part of this fix, an optimized version of the shader that performs the copying was written, to take advantage of the special case when the plane size (x*y) is multiple of 4).

After fixing this, and writing comprehensive tests, it was discovered that the copying still has issues on Android for specific input sizes, e.g. [1, 1, 11, 17]. These issues are currently unresolved, so, copying of quantized vulkan tensors to cpu has been disabled.

What is contained in this diff?
- Fix for existing issue
- New optimized shader (image_to_nchw_quantized_mul4)
- New comprehensive tests (which have been disabled)
- Disable the copying of quantized vulkan tensors to cpu until issues on Android are fixed.

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: kimishpatel

Differential Revision: D41047098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90275
Approved by: https://github.com/kimishpatel
2022-12-06 23:33:52 +00:00
23c192c3df [MPS] Fix median_out_mps caching (#90326)
We should cache graph based on input tensor type

Fixes https://github.com/pytorch/pytorch/issues/90311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90326
Approved by: https://github.com/kulinseth
2022-12-06 23:21:54 +00:00
b769005924 [fx][passes] Implement annotate getitem node FX passes (#90237)
Summary: One common cause of jit unscriptability issue is loss of node type annotations on local names after one or several FX transform(s). One way to improve the type coverage is to eagerly annotate the type for `getitem` nodes from its parent sequence node. This diff introduces an fx pass to do that.

Test Plan:
```
buck2 test //caffe2/test:fx_experimental
```

Reviewed By: xush6528

Differential Revision: D41749744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90237
Approved by: https://github.com/xush6528
2022-12-06 23:18:55 +00:00
0e182c9441 [quant][fx] Add support for matching constant in the custom matcher code in quantization (#90092)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_pattern_match_constant

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90092
Approved by: https://github.com/jcaip
2022-12-06 22:47:41 +00:00
5caa27a3fd as_strided: Fix default storage_offset for reference implementation (#89513)
This fixes the default storage_offset to take it from the input. This was
previously untested, so I've also added a new OpInfo which includes samples with
non-zero storage_offsets on the input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-12-06 22:39:21 +00:00
3d4b92b171 Ensure that we fakeify tensor subclasses when they are initially tracked (#90009)
The old code didn't actually fakeify traceable tensor subclasses at the
time they are added as a GraphArg to the module; now we do, by ignoring
the subclass during fakeification and relying on Dynamo to simulate
the subclass on top.  See comments for more details.

BTW, this codepath is super broken, see filed issues linked on the
inside.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90009
Approved by: https://github.com/wconstab, https://github.com/voznesenskym
2022-12-06 22:36:32 +00:00
f09e7b5ce7 Replace assertEqualIgnoreType in test_nn.py (#90242)
See https://github.com/pytorch/pytorch/issues/38095.

Also removed some redundant separate `dtype` checks when `dtype` is already checked by the next line's `assertEqual`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90242
Approved by: https://github.com/malfet
2022-12-06 22:34:01 +00:00
6c195881b1 [CI] Relax CMake requirements (#90307)
To `3.22.*` as cmake-3.22.1 is available on conda, but not on
conda-forge see
https://anaconda.org/conda-forge/cmake/files?version=3.22.2 but https://anaconda.org/anaconda/cmake/files?version=3.22.1

Also, for whatever reason we already specify cmake dependency in
acaef1ae39/.github/actions/setup-miniconda/action.yml (L172)
so may be it could be removed from this file already

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90307
Approved by: https://github.com/kit1980
2022-12-06 22:32:50 +00:00
3b9a386d48 Add TORCH_FAKE_TENSOR_DEBUG use it to enable storage of traces on fake tensors at init time (#90215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90215
Approved by: https://github.com/ezyang
2022-12-06 22:28:52 +00:00
d224ac7f77 Remove logging.CODE (#90234)
Fixes https://github.com/pytorch/torchdynamo/issues/1932

Discussed with @mlazos: if we still want to separate streams for code logging and the rest of info, we can use a separate logger object with a unique name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90234
Approved by: https://github.com/ezyang
2022-12-06 22:24:43 +00:00
14894a7311 Remove non-existing parameter from docstring (#90163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90163
Approved by: https://github.com/clee2000
2022-12-06 22:22:17 +00:00
7e9a8a1361 Disable dynamo tracing torchrec.distributed (#90087)
Summary: Context at T138318923

Test Plan: mannual test

Reviewed By: yf225

Differential Revision: D41631076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90087
Approved by: https://github.com/yf225
2022-12-06 22:17:16 +00:00
27ad2605c8 Hotfix to unblock TRT unit tests internally (#90313)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Export of [D41778303](https://www.internalfb.com/diff/D41778303)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90313
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-06 22:14:37 +00:00
eqy
62e450d55f [CUDA Graphs] Add option to dump a captured graph for debugging (#85519)
CC @xwang233 @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85519
Approved by: https://github.com/ngimel
2022-12-06 22:03:05 +00:00
1abe264ef0 [Upstream _NamedOptimzer] Reland PR (89480) (#90293)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Reland https://github.com/pytorch/pytorch/pull/89480/
* #90294
* __->__ #90293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293
Approved by: https://github.com/awgu
2022-12-06 21:47:12 +00:00
7436b19eb2 [FSDP] Clarify loss dtype check in _test_fsdp_parity (#90251)
A recent PR deprecated `torch.testing.assert_allclose` in favor of `torch.testing.assert_close` and left a `TODO`. This PR follows up to confirm that we do intend to have `check_dtype=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90251
Approved by: https://github.com/rohan-varma
2022-12-06 21:28:40 +00:00
919e09f26a [FSDP][BE] Clean up dead code from clip_grad_norm_() testing (#90250)
`FSDP.clip_grad_norm_()` is tested separately in `test_fsdp_clip_grad_norm.py`. This PR removes the dead non-run code from `common_fsdp.py` and `test_fsdp_core.py` related to `clip_grad_norm_()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90250
Approved by: https://github.com/rohan-varma
2022-12-06 21:28:40 +00:00
3b578edd04 [FSDP] Test use_orig_params=True in test_fsdp_ignored_modules.py (#90290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90290
Approved by: https://github.com/zhaojuanmao
2022-12-06 21:28:40 +00:00
25f39c1bce Fix uniform ref implementation (#90094)
Fixes https://github.com/pytorch/torchdynamo/issues/1954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90094
Approved by: https://github.com/ngimel
2022-12-06 21:28:17 +00:00
a1ab06ab65 ShapeEnv.create_symbolic_sizes_strides_storage_offset (#89962)
Instead of having storage offset hang out on its own, allocate
all of these symbols all in one go.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89962
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2022-12-06 21:27:02 +00:00
e818c36647 reland #89222: [Composable API] replicate: change to per module call, remove mark_root_module() (#90254)
reland https://github.com/pytorch/pytorch/pull/89222
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90254
Approved by: https://github.com/zhaojuanmao
2022-12-06 21:17:53 +00:00
bd9ad89a6d [FSDP] Fix accidental change in _test_fsdp_parity (#90252)
I accidentally changed the semantics of this line when refactoring a while ago. The [previous version](https://github.com/pytorch/pytorch/pull/80873/files#diff-7b5c66f99161fa6a3d9042e80f8c8cc140a64e43445feede46f55e53154f6c3dL635) used to say:
```
if not mixed_precision:
```
which is actually the opposite of
```
if mixed_precision is not None:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90252
Approved by: https://github.com/zhaojuanmao
2022-12-06 20:13:21 +00:00
ce21262808 Log1p for complex in CPU (#89691)
Another PR for https://github.com/pytorch/pytorch/issues/89205: making torch.log1p accepts complex numbers in CPU.
I haven't done the GPU version because I'm not sure which file(s) to change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89691
Approved by: https://github.com/jgong5, https://github.com/lezcano
2022-12-06 19:12:24 +00:00
9e314bd822 [dtensor] handle the case where output of op is Optional[Tensor] (#90241)
Observed by @aazzolini, some op might have Optional[Tensor] returns
where it return None (i.e. native_layer_norm_backward), it's a mismatch
between C++ aten op signature and python None, but we need to handle it
in the python side
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90241
Approved by: https://github.com/aazzolini
2022-12-06 18:17:20 +00:00
eace084815 Use Sized not Iterable to test for len (#90182)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90182
Approved by: https://github.com/albanD
2022-12-06 16:13:14 +00:00
c6942dbbfb add shape check for random_samples in fractional_max_pool{2d|3d} (#89992)
This PR add shape checks for `random_samples` in fractional_max_pool2d and fractional_max_pool3d.,
to provide more meaningful warnings instead of SegFault when the input is illegal.

For more details, please check https://github.com/pytorch/pytorch/issues/89648
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89992
Approved by: https://github.com/jgong5, https://github.com/ezyang
2022-12-06 14:14:41 +00:00
be5108d5f9 replace memset with value-initialization (#90048)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90048).
* #89865
* #89852
* #89851
* __->__ #90048

replace memset with value-initialization

Summary:
This is equivalent to zero initialization for any members that are
scalar or have implicit default constructors.

Note that aside from the reset at the beginning, blockmask and
philox_args are not touched by this function.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90048
Approved by: https://github.com/drisspg, https://github.com/malfet
2022-12-06 13:48:05 +00:00
97e47a52b8 [Quant] Add fused linear-leaky_relu op for onednn backend (#88478)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `linear-leaky_relu` op for `onednn` backend, which will be used for int8 inference with `onednn` backend. Cannot call this op with other quantization backends otherwise an error is thrown.

**Test Plan**
python test_quantization.py TestQuantizedLinear

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88478
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-06 08:32:59 +00:00
41bfa49db9 [ONNX] Add src/index dynamic axes support for aten::scatter_add (#90090)
Extend from #89787 , and answer from https://github.com/onnx/onnx/issues/4672, dynamically catching shape of index can let converter further support on this op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90090
Approved by: https://github.com/BowenBao
2022-12-06 07:56:20 +00:00
176b962f4b Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)"
This reverts commit 31ec1a1ef7032508fc36f0b70692832acbeed72d.

Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names
2022-12-06 07:22:37 +00:00
3c9431f505 Add factory functions to python frontend (#89230)
- Add `full` nvprim to support factory functions because the full reference uses `empty` and `fill` while we have a full factory function.
- Change `full_like` reference to call `full` to avoid defining another nvprim.
- Enable support for new_zeros to enable `cudnn_batch_norm` decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89230
Approved by: https://github.com/kevinstephano, https://github.com/mruberry
2022-12-06 07:16:21 +00:00
e645771e95 Revert "as_strided: Fix default storage_offset for reference implementation (#89513)"
This reverts commit ba70a8be03f2fca222deee030bf7d9d15260b549.

Reverted https://github.com/pytorch/pytorch/pull/89513 on behalf of https://github.com/kit1980 due to Broke multiple workflows, 2 unexpected successes for autograd tests
2022-12-06 07:14:16 +00:00
44dac51c36 Improve Autograd Documentation Clarity (#89401)
This makes minor adjustments to the autograd docs, improving clarity and resolving grammatical errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89401
Approved by: https://github.com/kit1980
2022-12-06 06:45:04 +00:00
49ccc41d57 [Vulkan] Enable QInt8 and QInt32 quantization (#89788)
Summary: Enabled Vulkan quantization for dtypes QInt8 and QInt32

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Differential Revision: D41561661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89788
Approved by: https://github.com/digantdesai
2022-12-06 06:27:40 +00:00
45b40be078 [FSDP()] Fix fully_shard fwd hook registration (#90201)
I need to rebase later after Shen's PRs land.

The idea is to only register the pre/post-forward hook on the _root modules_ among the modules that consume a `FlatParameter`. (Yes, the term _root module_ is heavily overloaded. We may want to clarify that at some point. Here, _root_ is being used in the graph sense, meaning parent-less, and the scope is only among the modules consuming a `FlatParameter`.)

This avoids unnecessary pre/post-forward hooks running, which would lead to errors because the unshard is not truly idempotent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90201
Approved by: https://github.com/mrshenli, https://github.com/rohan-varma
2022-12-06 06:09:03 +00:00
2b7fcfa399 fix: Moving operators to FuncTorchBatchedDecomposition (#89762)
Some of the easy to move operators I've moved over and removed an xfail.

I found this from the test that I implemented in https://github.com/pytorch/pytorch/pull/89465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89762
Approved by: https://github.com/zou3519
2022-12-06 05:59:47 +00:00
bb673fb1d9 fix: update error when tensor escapes vmap (#89077)
Fixes https://github.com/pytorch/functorch/issues/1054

@zou3519, I played around with it, but I am unsure of how to repro the cases for gen_vmap_inplace_plumbing and below in gen_vmap_plumbing_no_returns

I've also seen that there are 24 other instances of the `TORCH_INTERNAL_ASSERT(maybe_layer.has_value());` assert, should I change all of these and add tests?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89077
Approved by: https://github.com/zou3519
2022-12-06 05:52:09 +00:00
2c2cce73d4 [dtensor] remove torchgen function schema and parse manually (#90106)
This PR get rids of torchgen FunctionSchema parsing and parse
it manually, it should resolve torchgen package issue and also
provide some perf wins when running DTensor eagerly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90106
Approved by: https://github.com/awgu
2022-12-06 05:45:00 +00:00
a0c7b88861 remove backward hook in memory_tracker (#90143)
remove backward hook in memory_tracker, as it does not work well with jagged tensor in some cases, it is OK to remove this hook for now as it does not really track any stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90143
Approved by: https://github.com/rohan-varma
2022-12-06 05:39:59 +00:00
6bbcd025bd Fix issue 38095 TODO in onnx/test_utility_funs.py (#90085)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90085
Approved by: https://github.com/BowenBao
2022-12-06 05:29:50 +00:00
508916128d [ReduceOp] ameliorate custom __eq__ (#90088)
Improve the completeness of `ReduceOp.__eq__`.

Should support the equal operator with the first argument of `RedOpType` and the second of `ReduceOp` in a follow-up.

Fixes #90072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90088
Approved by: https://github.com/kwen2501
2022-12-06 05:13:50 +00:00
2d9267ba30 [dynamo] Rewrite addcdiv in dynamo to its constituent ops (#90227)
This avoids a graph break when `value` is used. This fixes a graph break in the variants of Adam and Adagrad optimizers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90227
Approved by: https://github.com/jansel
2022-12-06 05:08:44 +00:00
77f9b2e8bf Fix exception causes in fx, nn and onnx packages (#90134)
This is a continuation of #90118

@kit1980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90134
Approved by: https://github.com/kit1980
2022-12-06 04:34:58 +00:00
31ec1a1ef7 [PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)
In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters.

Now composability workstream need a FQN based way to index optimizer state_dict for parameters..

For example, SGD optimizer might have something in its `state_dict` like:

```
{'state':
  {0:
    {'momentum_buffer': tensor(...)},
  {1:
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}]
}
```

And in NamedOptimizer we want the `state_dict` can be:

```
{'state':
  {'net1.0.weight':
    {'momentum_buffer': tensor(...)},
  {'net1.0.bias':
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}]
}
```

We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer.

For the next couple PR/diffs, we also need to:
1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components.
2. Make `NamedOptimizer` works well with apply_optim_in_backward
3. Upstream also `CombinedOptimizer`.

Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480
Approved by: https://github.com/rohan-varma
2022-12-06 04:34:19 +00:00
cee396fa07 [ao][ns] PNP demo for exposing arbitrary model transforms (#90153)
adding way to use arbitrary prepare and convert functions with PNP.

note this is a recreation of https://github.com/pytorch/pytorch/pull/89892 which was reverted due to landing not syncing between github and fbcode

python test/test_quantization.py
TestFxNumericSuiteNShadows.test_custom_functions_and_tracer

Differential Revision: [D41723892](https://our.internmc.facebook.com/intern/diff/D41723892/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90153
Approved by: https://github.com/vkuzo
2022-12-06 04:24:54 +00:00
42705bd7b3 Disallow registering meta function for CompositeImplicitAutograd ops (#90222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90222
Approved by: https://github.com/ezyang
2022-12-06 04:22:31 +00:00
a88400e0cc pad low precision matmuls when requested (#90235)
Matmul padding is beneficial not only for fp32, fp16/bf16 with amp can benefit as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90235
Approved by: https://github.com/jiawenliu64
2022-12-06 04:13:24 +00:00
ba70a8be03 as_strided: Fix default storage_offset for reference implementation (#89513)
This fixes the default storage_offset to take it from the input. This was
previously untested, so I've also added a new OpInfo which includes samples with
non-zero storage_offsets on the input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-12-06 04:07:16 +00:00
05ccbd6d94 Functionalization: skip meta block computation if compute_reference_meta is false (#90219)
Skip computing meta block when `compute_reference_meta` is `False`.

Issue: #89914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90219
Approved by: https://github.com/ezyang
2022-12-06 04:03:01 +00:00
962ebe88a2 Assert there are no outstanding side effects before calling cond (#90208)
The current cond implementation is silently incorrect when
there are outstanding side effects, since the locally tracked
side effects are lost when the recursive export call is made.
At least we raise an assert now.

I'm working on a refactor of cond which should be able to sidestep
this problem. Maybe.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D41746973](https://our.internmc.facebook.com/intern/diff/D41746973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90208
Approved by: https://github.com/voznesenskym
2022-12-06 03:53:48 +00:00
0d8e53dfe7 Revert "[Composable API] replicate: change to per module call, remove mark_root_module() (#89222)"
This reverts commit 65a0dcffd8d387bb8c90216e63fdabb6e33e4e4d.

Reverted https://github.com/pytorch/pytorch/pull/89222 on behalf of https://github.com/malfet due to Included unintended submodule updates
2022-12-06 03:26:28 +00:00
73565ce320 [vision hash update] update the pinned vision hash (#90239)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90239
Approved by: https://github.com/pytorchbot
2022-12-06 03:25:17 +00:00
3749b9dc73 Revert "[Composable API] replicate: add support for DDP args (#89243)"
This reverts commit 0f274ed385d676cb28c792ca104114ca63210055.

Reverted https://github.com/pytorch/pytorch/pull/89243 on behalf of https://github.com/malfet due to Depends on https://github.com/pytorch/pytorch/pull/89222 that introduced spurious module updates
2022-12-06 03:22:18 +00:00
2597d5d722 TorchDynamo: always convert flexiblelayout to be FixedLayout when given a stride_order (#89904)
For convolution, we always call **require_stride_order** to convert the input to the target stride order,  if the original input's layout is flexiblelayout, there always have a memory copy because the **is_stride_order_storage_and_layout** only checks the init stride order,  I think for flexiblelayout, means it's layout can be changed, if the user gives a stride order, I think we always need to convert the flexiblelayout to be FixedLayout using given strider order.

Given a CV user case, the max_pooling's output is used by two convolutions, there has two memory copies:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<9; i2+=1)
            {
                {
                    {
                        auto tmp0 = out_ptr0[i1 + (3*i2) + (27*i0)];
                        out_ptr1[i1 + (3*i2) + (27*i0)] = tmp0;
                        out_ptr2[i1 + (3*i2) + (27*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    buf2 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    buf4 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf4.data_ptr()))
    del arg4_1
    del buf0
    buf3 = torch.ops.mkldnn._convolution_pointwise(buf2, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg0_1
    del arg1_1
    del buf2
    buf5 = torch.ops.mkldnn._convolution_pointwise(buf4, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf5, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg2_1
    del arg3_1
    return (buf3, buf5, )
```

After this PR, the generated  code will remove the redundant memory copy:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg4_1
    buf2 = torch.ops.mkldnn._convolution_pointwise(buf0, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf2, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg0_1
    del arg1_1
    buf3 = torch.ops.mkldnn._convolution_pointwise(buf0, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg2_1
    del arg3_1
    return (buf2, buf3, )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89904
Approved by: https://github.com/jansel
2022-12-06 03:07:53 +00:00
29233a18c7 [inductor] Add test_ops_gradients running with inductor (#89792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89792
Approved by: https://github.com/janeyx99, https://github.com/clee2000, https://github.com/huydhn
2022-12-06 02:26:29 +00:00
ebeecbf833 Dynamo FX graph stack traceback fix (#87136)
Migration from https://github.com/pytorch/torchdynamo/pull/1655.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87136
Approved by: https://github.com/voznesenskym
2022-12-06 02:22:16 +00:00
a268b9e53c Fix yet another C++17 Windows build issue (#90228)
Not sure why, but top-level `using namespace` directive causes VC++ fail with (if C++17 standard is used, but everything is fine with C++14):
```
C:\actions-runner\_work\pytorch\pytorch\third_party\pybind11\include\pybind11\detail\../pytypes.h(1520): error C2872: 'attr': ambiguous symbol
C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/core/interned_strings.h(349): note: could be 'c10::attr'
C:\actions-runner\_work\pytorch\pytorch\torch/csrc/jit/ir/ir.h(75): note: or       'torch::jit::attr'
C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\pybind11\include\pybind11/pybind11.h(1094): note: see reference to function template instantiation 'pybind11::str pybind11::str::format<_Ty1&>(_Ty1 &) const' being compiled
        with
        [
            _Ty1=pybind11::handle
        ]
```

Solve this by replacing global `using namespace torch::jit;` with
specific usages of objects/methods from namespaces

Another prep change for https://github.com/pytorch/pytorch/70188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90228
Approved by: https://github.com/kit1980, https://github.com/albanD
2022-12-06 01:35:19 +00:00
55b10e6b1d [Pytorch][Vulkan] Use specalized shader for 3x3 depthwise conv (#89953)
This diff uses specialized implementation for 3x3 and 5x5 dw conv.

Differential Revision: [D41006638](https://our.internmc.facebook.com/intern/diff/D41006638/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89953
Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign
2022-12-06 00:56:57 +00:00
a17765a127 [Pytorch][Vulkan] Templatize depth wise convolution and specialize for 3x3 and
5x5 (#89952)

5x5

This diff does not yet integrate with the runtime.

Differential Revision: [D41006640](https://our.internmc.facebook.com/intern/diff/D41006640/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89952
Approved by: https://github.com/salilsdesai
2022-12-06 00:54:59 +00:00
bd456fb549 [Pytorch][Vulkan] shader codegen use ordered dictionary (#89951)
When not using ordered dictionary, it can result in parameter values have
different order for each specialization. This can result shader names which are
not consistent in their naming and meaning of the template parameter values
that appear in the meaning of their names.
For example if you have:
conv2d_pw:
  default_values:
   - X: 1
   - Y: 2
  parameter_values:
   - Y: 3

Default parameter value can generate shader with 'my_shader_1x2' where 1x2 is
for X, Y parameters respectively. Then,
for non default values, of which there is only 1, we have Y=3 and with existing
implementation you can end up genreating shader with 'my_shader_3x1'. Here 3 is
for Y and 1 is for X. This leads to confusing shader names.

THis diff fixes this by
1. using ordered dict.
2. non default values are updated by first copying default values and then
updating them.

Differential Revision: [D41006639](https://our.internmc.facebook.com/intern/diff/D41006639/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89951
Approved by: https://github.com/salilsdesai
2022-12-06 00:49:35 +00:00
cb68dcbd6b [Pytorch][vulkan] Simplify depthwise conv to remove bounds compute (#89950)
Right now we are doing bounds check and reduce compute according to bounds
check. However this can lead to thread divergence.
Furthermore since textures provide handling of border region, it should be safe
to use negative indexing.

Differential Revision: [D41006645](https://our.internmc.facebook.com/intern/diff/D41006645/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89950
Approved by: https://github.com/salilsdesai
2022-12-06 00:47:17 +00:00
876b70245a [Vulkan] output benchmark numbers for aibench parsing (#89949)
Add this util so as to easily benchmark shaders and summarize the output.
Eventually the shader benchmarking should obsolete the need for this.

Differential Revision: [D41244028](https://our.internmc.facebook.com/intern/diff/D41244028/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41244028/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89949
Approved by: https://github.com/digantdesai, https://github.com/salilsdesai
2022-12-06 00:01:49 +00:00
841eba6382 [pytorch][vulkan] realistic benchmark size for depthwise (#89948)
Update benchmark size to be bigger tensors to get mor realistic numbers

Differential Revision: [D41006643](https://our.internmc.facebook.com/intern/diff/D41006643/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89948
Approved by: https://github.com/digantdesai, https://github.com/salilsdesai
2022-12-05 23:59:25 +00:00
564905c8e1 [Caffe2] Fix the assert message (#89816)
Summary:
As title.
dev1/2 is invalid. It should be dev_1/2 instead

Test Plan: Sandcastle

Differential Revision: D41569982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89816
Approved by: https://github.com/PaliC
2022-12-05 23:40:08 +00:00
2ea32f41f4 Fix XLA dynamo CI (#90229)
Fixes https://github.com/pytorch/xla/issues/4274

We should not access `subgraph` once it is deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90229
Approved by: https://github.com/voznesenskym
2022-12-05 22:38:11 +00:00
5d6aa99c45 Add sharding strategy to fully_shard (#90192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90192
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2022-12-05 22:20:25 +00:00
e4670885b9 Add a repro for fully_shard _unshard error (#90190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90190
Approved by: https://github.com/awgu
2022-12-05 22:20:25 +00:00
0f274ed385 [Composable API] replicate: add support for DDP args (#89243)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89243
Approved by: https://github.com/zhaojuanmao
2022-12-05 21:38:23 +00:00
72fdfad4ad [FSDP][optim_state_dict][1/N] Restructure _optim_state_dict to prepare the support of use_orig_param (#89898)
**Motivation:**
Restructure some APIs in _optim_state_dict.py to allow better future extension, mostly for supporting use_orig_params. NO logic change in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89898
Approved by: https://github.com/awgu
2022-12-05 21:01:48 +00:00
2b20a3d3ef Simplify by using yield from (#90160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90160
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-12-05 20:48:05 +00:00
54858cce4e Fix issue 38095 TODOs in NCCL tests (#90033)
Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90033
Approved by: https://github.com/awgu
2022-12-05 20:33:23 +00:00
7571134f69 [NNC] Use New PassManager for LLVM >= 15 (#89978)
This is needed because TargetMachine::adjustPassManager was removed in https://reviews.llvm.org/D137796. However, we need to keep around the old pass manager implementation for LLVM < 12.

Based on this: https://llvm.org/docs/NewPassManager.html

Tests: `./build/bin/test_tensorexpr` passes.

RUN_TORCHBENCH: nvfuser

Differential Revision: [D41636445](https://our.internmc.facebook.com/intern/diff/D41636445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89978
Approved by: https://github.com/bertmaher
2022-12-05 19:19:36 +00:00
5de5c5e462 Assume that co_firstlineno is always defined (#90180)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90180
Approved by: https://github.com/albanD
2022-12-05 19:15:35 +00:00
1ea20cdb33 workaround for indexing formulas with negative terms (#89933)
Fixes https://github.com/pytorch/torchdynamo/issues/1928
For  `ModularIndexing` we generate indexing code with `//` and `%` operators. When `ModularIndexing` base is negative (that can happen after valid simplifications), `//` in triton produces wrong results https://github.com/openai/triton/issues/619/. For `//` op coming from pytorch, we have codegen workarounds, but I'm reluctant to apply these workarounds to very common indexing computation patterns, both for code readability and perf considerations.
Similarly, we replace `ModularIndexing` with `IndexingDiv` when we can prove that base is small, but those assumptions break when `ModularIndexing` base is negative (`ModularIndexing` is always positive, `IndexingDiv` isn't).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89933
Approved by: https://github.com/jansel
2022-12-05 19:12:29 +00:00
368a1cbd02 fix c10::detail::integer_iterator for C++17 (#90174)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90174).
* __->__ #90174

fix c10::detail::integer_iterator for C++17

Summary: std::iterator is deprecated.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90174
Approved by: https://github.com/clee2000, https://github.com/malfet
2022-12-05 18:39:47 +00:00
5423c2f0e2 Light refactor to how we get shape_env for graph lowering (#90139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90139
Approved by: https://github.com/ezyang
2022-12-05 18:35:30 +00:00
32639a822c Fix missing line in XLA backend after mergebot + ghstack gap (#90197)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90197
Approved by: https://github.com/clee2000
2022-12-05 18:30:05 +00:00
7e034193bb [LTC] Restore default ctor for LazyTensor (#90086)
Summary:
This pull request introduced a temporarily change that make XLA's LTC migration easier. One step among is to make XLATensor naively inherits LazyTensor and that requires LazyTensor to have a default constructor.

Test Plan:
CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90086
Approved by: https://github.com/JackCaoG, https://github.com/kit1980
2022-12-05 18:26:37 +00:00
65a0dcffd8 [Composable API] replicate: change to per module call, remove mark_root_module() (#89222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89222
Approved by: https://github.com/zhaojuanmao
2022-12-05 17:54:55 +00:00
8845a8f899 Revert "as_strided: Fix default storage_offset for reference implementation (#89513)"
This reverts commit eded97ac7224ad5f80334acf57a3b0c24f83d89f.

Reverted https://github.com/pytorch/pytorch/pull/89513 on behalf of https://github.com/peterbell10 due to broke master
2022-12-05 17:53:23 +00:00
6d794f6a4a [ONNX] Fix concat with empty tensors (#87620)
Fixes #54410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87620
Approved by: https://github.com/BowenBao
2022-12-05 17:36:31 +00:00
301d9c0556 Remove deprecated usage of is_pod/is_pod_v (#88918)
… as equivalent replacements for std::is_pod and std::is_pod_v because they are deprecated in C++20.

When consuming libtorch header files in a project that uses C++20, there are warnings about std::is_pod being deprecated.  This patch fixes that issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88918
Approved by: https://github.com/ezyang
2022-12-05 16:50:00 +00:00
b1eb42bcfd [4/4][DataPipe] Remove iterator depletion in Zipper (#89974)
Fixes: https://github.com/pytorch/data/issues/865

I will add another PR in torchdata to validate this change would solve the infinite datapipe problem (I have tested locally). This is one of the most annoying stack of PRs cause by separation between TorchData and PyTorch.

There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](4451eb24e6/torch/utils/data/datapipes/iter/combining.py (L573-L583)) to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: https://github.com/pytorch/data/issues/865

So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more.

Differential Revision: [D41699469](https://our.internmc.facebook.com/intern/diff/D41699469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89974
Approved by: https://github.com/NivekT, https://github.com/wenleix
2022-12-05 16:45:34 +00:00
eded97ac72 as_strided: Fix default storage_offset for reference implementation (#89513)
This fixes the default storage_offset to take it from the input. This was
previously untested, so I've also added a new OpInfo which includes samples with
non-zero storage_offsets on the input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-12-05 15:52:49 +00:00
199b8b6025 Remove deprecated flatten_params_wrapper.py from lintrunner config (#90154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90154
Approved by: https://github.com/awgu
2022-12-05 15:21:47 +00:00
7a08261a9c Fix fully_shard error when policy is not provided (#90151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90151
Approved by: https://github.com/awgu
2022-12-05 15:21:47 +00:00
777ac632fb Added vectorized flip for uint8 (#90013)
Following https://github.com/pytorch/pytorch/pull/89414#discussion_r1036224613 just refactoring and adding `flip` method for `Vectorized<uint8>`. This should speed up torch.flip horizontal implementation similarly to what is reported in https://github.com/pytorch/pytorch/pull/89414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90013
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2022-12-05 12:23:28 +00:00
226e803ecb [Inductor] handle non-positive exponents in Pow (#90146)
Fixes #90125.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90146
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-12-05 09:16:35 +00:00
41c3b41b92 Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] (#90039)
After all of the preparatory commits, this is a subset of the
changes in https://github.com/pytorch/pytorch/pull/89392 that actually
change us to propagating fake tensors to backends.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

This is the merger of Ed's PR #89672, which is a rewrite of an older PR of mine (#89392), with CI Fixes on top of it (#89773)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90039
Approved by: https://github.com/ezyang
2022-12-05 01:56:50 +00:00
4648baa911 Revert "Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] (#90039)"
This reverts commit ef0c7ec958439caf44a98fb7b70d920c6c2264b9.

Reverted https://github.com/pytorch/pytorch/pull/90039 on behalf of https://github.com/clee2000 due to broke xla tests ef0c7ec958 https://github.com/pytorch/pytorch/actions/runs/3606308473/jobs/6077646142
2022-12-04 21:57:30 +00:00
a580a63448 [codemod][llvm15] LLVM-15 fixes for caffe2/test/cpp/jit/test_module_api.cpp (#89938)
Summary: This fixes issues which block `caffe2/test/cpp/jit/test_module_api.cpp` from compiling with LLVM-15.

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D41603454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89938
Approved by: https://github.com/soumith
2022-12-04 12:50:14 +00:00
d6c8603b98 Fix warning: use of bitwise '&' with boolean operands (#90131)
```
[130/1102] Building CXX object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cudnn/LossCTC.cpp.o
/home/gaoxiang/nvfuser5/aten/src/ATen/native/cudnn/LossCTC.cpp:97:11: warning: use of bitwise '&' with boolean operands [-Wbitwise-instead-of-logical]
          (target_lengths[b] < 256) & (target_lengths[b] <= input_lengths[b]);
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                    &&
/home/gaoxiang/nvfuser5/aten/src/ATen/native/cudnn/LossCTC.cpp:97:11: note: cast one or both operands to int to silence this warning
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90131
Approved by: https://github.com/kit1980
2022-12-04 08:47:20 +00:00
57bb4cd046 [Doc][Distributed] Add missing functions to distributed.rst (#89905)
Add missing documents for `torch.distributed.all_to_all_single` and other functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89905
Approved by: https://github.com/kit1980
2022-12-04 07:22:54 +00:00
f3aeed4960 Add generator argument to torch.rand docstring (#90071)
The documentation of `torch.rand` was missing the `generator` keyword argument in the function signature. However, the argument is explained in the documentation and `torch.rand` accepts that argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90071
Approved by: https://github.com/janeyx99
2022-12-04 07:19:24 +00:00
1a25e6f3c3 Fix indentation (#90110)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90110
Approved by: https://github.com/kit1980
2022-12-04 07:13:53 +00:00
7322f73c8f Fix exception cause in storage.py (#90118)
This change causes the correct message to be shown between the two tracebacks when an error is shown.

More context here: https://blog.ram.rachum.com/post/621791438475296768/improving-python-exception-chaining-with
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90118
Approved by: https://github.com/kit1980
2022-12-04 06:51:25 +00:00
c00d395f05 Revert D41682843: Multisect successfully blamed D41682843 for test or build failures (#90132)
Summary:
This diff is reverting D41682843
D41682843 has been identified to be causing the following test or build failures:
Tests affected:
- https://www.internalfb.com/intern/test/281475048939643/

Here's the Multisect link:
https://www.internalfb.com/intern/testinfra/multisect/1444954
Here are the tasks that are relevant to this breakage:
T93770103: 5 tests started failing for oncall assistant_multimodal in the last 2 weeks
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

Test Plan: NA

Reviewed By: zyan0, atuljangra, YazhiGao

Differential Revision: D41710749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90132
Approved by: https://github.com/awgu
2022-12-04 05:35:17 +00:00
bda6ff0990 [1/4][DataPipe] Properly cleanup unclosed files within generator function (#89973)
There is a case that `file.close` is never called because when generator function has never reached to the end. A simple example would be `zip` two datepipes with different length. The longer DataPipe would never reach the end of generator and then it will be cleaned up by `gc`. So, the line of `file.close` is not executed. (This is the reason that Vitaly has to create this [hack](4451eb24e6/torch/utils/data/datapipes/iter/combining.py (L573-L583)) to retrieve all remaining data to make sure generator function is fully executed)

However, this hack introduces another problem where an infinite datapipe would make `zip` never end as it would try to deplete the infinite iterator. See: https://github.com/pytorch/data/issues/865

So, in this PR, I am adding a `try-finally` clause to make sure the `file.close` is always executed during the destruction of `generator` object. Then, we don't need the hack within `zip` any more.

Differential Revision: [D41699470](https://our.internmc.facebook.com/intern/diff/D41699470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89973
Approved by: https://github.com/NivekT
2022-12-04 04:04:46 +00:00
2bca280a31 Revert D41683102: Multisect successfully blamed D41683102 for test or build failures (#90117)
Summary:
This diff is reverting D41683102
D41683102 has been identified to be causing the following test or build failures:
Tests affected:
- https://www.internalfb.com/intern/test/281475051072735/

Here's the Multisect link:
https://www.internalfb.com/intern/testinfra/multisect/1444960
Here are the tasks that are relevant to this breakage:
T124964606: 41 tests started failing for oncall ads_trainer_release in the last 2 weeks
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

Test Plan: NA

Reviewed By: jspark1105

Differential Revision: D41710842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90117
Approved by: https://github.com/soumith
2022-12-03 19:54:04 +00:00
e47af44eb8 [FSDP][Easy] Remove unused methods (#89229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89229
Approved by: https://github.com/mrshenli
2022-12-03 17:55:27 +00:00
1ee189ce8e [FSDP] Issue warning when clamping to NO_SHARD (#90060)
Fixes https://github.com/pytorch/pytorch/issues/90050. I hope that this was not meant as an onboarding task :/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90060
Approved by: https://github.com/zhaojuanmao
2022-12-03 15:58:25 +00:00
4068c5467d [Reland] Move functorch/_src to torch/_functorch (#88756) (#90091)
This will be the last disruptive functorch internals change.

Why are we moving these files?
- As a part of rationalizing functorch we are moving the code in
functorch/_src to torch/_functorch
- This is so that we can offer the functorch APIs as native PyTorch APIs
(coming soon) and resolve some internal build issues.

Why are we moving all of these files at once?
- It's better to break developers all at once rather than many times

Test Plan:
- wait for tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90091
Approved by: https://github.com/anijain2305, https://github.com/ezyang
2022-12-03 14:17:15 +00:00
eqy
f7520cb51e Reduce memory usage requirement of test_pdist_norm_large in test_torch.py (#90075)
Basically the same fix as #85373, `/usr/bin/time` indicates that the memory requirement on the host-side was actually ~64GiB before the workaround and ~30GiB after.

CC @ptrblck @davidberard98

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90075
Approved by: https://github.com/davidberard98
2022-12-03 05:28:21 +00:00
61bd7fbacb [vision hash update] update the pinned vision hash (#90095)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90095
Approved by: https://github.com/pytorchbot
2022-12-03 03:10:09 +00:00
e53a0e391b [Easy] Remove unused parametrization (#90079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90079
Approved by: https://github.com/awgu
2022-12-03 03:03:13 +00:00
dd060f359e Test composable checkpoint wrapping FSDP submodules (#90078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90078
Approved by: https://github.com/awgu
2022-12-03 03:03:13 +00:00
a775204499 Fix issue 38095 TODO in test_dataloader.py (#90084)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90084
Approved by: https://github.com/clee2000, https://github.com/NivekT
2022-12-03 03:01:52 +00:00
ef0c7ec958 Use dynamo fake tensor mode in aot_autograd, move aot_autograd compilation to lowering time [Merger of 89672 and 89773] (#90039)
After all of the preparatory commits, this is a subset of the
changes in https://github.com/pytorch/pytorch/pull/89392 that actually
change us to propagating fake tensors to backends.

Signed-off-by: Edward Z. Yang <ezyangfb.com>

This is the merger of Ed's PR #89672, which is a rewrite of an older PR of mine (#89392), with CI Fixes on top of it (#89773)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90039
Approved by: https://github.com/ezyang
2022-12-03 01:19:55 +00:00
9a1c6fd506 [pruning][core][feature] Align BaseStructuredPruner with existing pruning flow (#88436)
Summary:

This PR aligns the "eager" mode of the structured pruning flow with the existing unstructured pruning flow.

The base pruner has been moved from and has been renamed from BasePruner to BaseStructuredPruner
`torch/ao/pruning/_experimental/pruner/base_pruner.py -> /torch/ao/pruning/_experimental/pruner/base_structured_pruner.py`

Support for pruning batchnorm modules in the config have been removed, so now the structured pruning code can use more of the BaseSparsifier logic and we don't need to override as many functions.

Since we aim to only support a single flow, we have only updated ZeroesParametrizations (FakeStructuredSparsity) and BiasHook.
The parameterizations have also been rewritten to use a bool mask tensor for keeping track of pruned rows, instead of using sets before.
This better aligns structured and unstructured sparsity.

The BaseStructuredSparsifier tests have also been updated to reflect the above changes. I also removed `squash_mask` tests because they were breaking CI and `squash_mask` is no longer used.

We will migrate the structured pruning code out of this folder in a later PR.

Test Plan:
```
python test/test_ao_sparsity -- TestBaseStructuredPruner
```

Reviewers:
z-a-f vkuzo

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88436
Approved by: https://github.com/vkuzo
2022-12-03 00:53:53 +00:00
d3f20a20b8 [reland][quant] Explictly set default quantized engine instead of relying on the order of supported_qengines (#89804) (#90036)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/86404

Test Plan:
ossci + sandcastle
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90036
Approved by: https://github.com/andrewor14
2022-12-03 00:12:00 +00:00
65f38160f0 Fix issue 38095 TODOs in test_quantized_op.py (#89883)
Fix TODOs related to https://github.com/pytorch/pytorch/issues/38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89883
Approved by: https://github.com/clee2000
2022-12-03 00:05:23 +00:00
29d1d8f3ef [Quant] Remove explicitly default QConfigMapping settings (#90066)
Summary: Previously we explicitly set a qconfig for ops
like conv and linear in the default QConfigMapping. However,
this makes it difficult for user to override the global and
have the new global take effect for basic ops. This commit
removes these explicit settings so the user can simply run
the following to quantize these ops.
```
qconfig_mapping = get_default_qconfig_mapping()
qconfig_mapping.set_global(my_qconfig)
```
There is no change in behavior for the default use case
of not setting anything on the default QConfigMapping.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_default_qconfig_mapping_override_global

Reviewers: vkuzo, jerryzh168

Subscribers: vkuzo, jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90066
Approved by: https://github.com/vkuzo, https://github.com/jerryzh168
2022-12-02 23:33:47 +00:00
a306f85ea7 Update Persons of Interest (#90069)
Creates sections for contributors to MaskedTensor and NestedTensor and updates torchaudio.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90069
Approved by: https://github.com/drisspg, https://github.com/mikaylagawarecki, https://github.com/nateanl
2022-12-02 23:06:57 +00:00
9d54d3bec2 [NVFuser] undo v100 OOM skips (#90070)
Summary: I think these were just caused by parallel tests. After adjusting test settings to 1 thread, these stopped OOMing.

Test Plan:
```
$ buck2 test -j 1 mode/dev-nosan //caffe2/torch/csrc/jit/codegen/cuda:nvfuser
```
https://www.internalfb.com/intern/testinfra/testrun/6473924590389963

Differential Revision: D41643827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90070
Approved by: https://github.com/jjsjann123
2022-12-02 21:58:24 +00:00
74a090a744 Add integration test for composable fully_shard and checkpoint (#90041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90041
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2022-12-02 21:57:08 +00:00
cba96366a2 Revert "remove torch.equal usages (#89527)"
This reverts commit 4095ef8b809f922f2e0e09011afd00037d20a771.

Reverted https://github.com/pytorch/pytorch/pull/89527 on behalf of https://github.com/clee2000 due to broke periodic multigpu tests 4095ef8b80 https://github.com/pytorch/pytorch/actions/runs/3592806602/jobs/6049368502
2022-12-02 21:36:13 +00:00
e1532af0bb Fix meta registration for aten._cdist_forward (#90042)
Error from [7k github model](https://github.com/pytorch/torchdynamo/issues/1884).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90042
Approved by: https://github.com/ezyang, https://github.com/eellison
2022-12-02 21:13:52 +00:00
eb56b08f96 [FSDP] Fix clip_grad_norm_() for low prec grads (#90028)
For PyTorch FSDP, the only way that gradients are in low precision is if `keep_low_precision_grads=True` or if the user turns on AMP. This PR adds tests for the former and improves the documentation for `clip_grad_norm_()`, especially around these non-full-precision cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90028
Approved by: https://github.com/rohan-varma
2022-12-02 21:10:45 +00:00
688b767265 [FSDP] Fix keep_low_precision_grads=True for use_orig_params=True (#90027)
For any `flat_param.data = flat_param.to(...)` or `flat_param.grad.data = flat_param.grad.to(...)`, we must also refresh sharded parameter/gradient views, respectively, if the storage changes.

For `keep_low_precision_grads=True` and a sharded strategy, we cast the gradient back to the low precision using `.data` to bypass the PyTorch check that a parameter and its gradient have the same dtype. For `use_orig_params=True` before this PR, the gradient would incorrectly still be in full precision, not low precision, since we did not refresh views (this can actually be considered a memory leak since we have two copies of the gradient now, one in low precision and one in full precision). This PR refreshes the views.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90027
Approved by: https://github.com/mrshenli
2022-12-02 21:10:45 +00:00
f5fbb5001f Revert "[follow-up] Python Attr Serialization (#88913)"
This reverts commit 086b251f9aeceaad95059de860ae81fd06526533.

Reverted https://github.com/pytorch/pytorch/pull/88913 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-12-02 20:14:11 +00:00
78bdb858f9 Call _sdp_attention in nn.functional.mha (#89470)
# Summary
Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470
Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb
2022-12-02 19:46:22 +00:00
3916d729c8 [Dynamo] tensor.type() should return tensor types with CPU and GPU variants (#90021)
Fix errors from [7k github models](https://github.com/pytorch/torchdynamo/issues/1884)
```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1062, in get_fake_value
    return wrap_fake_exception(
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 739, in wrap_fake_exception
    return fn()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1063, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1112, in run_node
    raise RuntimeError(
RuntimeError: Failed running call_function <function einsum at 0x7fd8f246a4c0>(*('i,j->ij', FakeTensor(FakeTensor(..., device='meta', size=(4,)), cpu), FakeTensor(FakeTensor(..., device='meta', size=(2,)), cuda:0)), **{}):
Unhandled FakeTensor Device Propagation for aten.mul.Tensor, found two different devices cpu, cuda:0
(scroll up for backtrace)
```

The root cause is: ```tensor.type()``` should return ```torch.cuda.FloatTensor``` rather than ```torch.FloatTensor``` if it's on GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90021
Approved by: https://github.com/jansel
2022-12-02 18:57:43 +00:00
538f6279db Fix access to unitialized memory in VSX vector functions (#89833)
This results in e.g. failures in TestNNDeviceTypeCPU.test_groupnorm_nhwc_cpu_float32

So simply initialize the stack array with zeroes as expected and done in other implementations

Fixes #32502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89833
Approved by: https://github.com/ezyang
2022-12-02 18:19:42 +00:00
acd68f9097 [Reland] dont clone args (#89766)
Reland of https://github.com/pytorch/pytorch/pull/89519.

Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning because of the 250mb cache clearing in triton benchmarking.

Reland bc previously we weren't accounting for inplace buffer reuse correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89766
Approved by: https://github.com/jansel
2022-12-02 17:20:40 +00:00
59101b6fe4 Fix binary iOS uploads (#90058)
curl on CircleCI MacOS runners does not support `--retry-all-errors`
Should fix https://app.circleci.com/pipelines/github/pytorch/pytorch/618606/workflows/6f104c19-3a3a-479d-a686-4961ddd87657/jobs/17233205
Yet another fallback of https://github.com/pytorch/pytorch/pull/89157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90058
Approved by: https://github.com/jeanschmidt
2022-12-02 14:28:19 +00:00
f62e54df8f Reland "Dynamo, FX, Inductor Progress Bars (#88384)" … (#90055)
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.

Original commit: #88384 (011452a2a1c745d4b12f83f89eca039f482d134b)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f22804be6909e54fc09e07f891ab0886774
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-12-02 13:28:00 +00:00
b87682f555 Fix gradcheck for CSR and CSC inputs. (#89786)
Partially fix-es https://github.com/pytorch/pytorch/issues/87085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89786
Approved by: https://github.com/albanD
2022-12-02 12:35:20 +00:00
526e4aa5f8 Update to_sparse docs regarding the layout and blocksize kw arguments. (#89912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89912
Approved by: https://github.com/cpuhrsch
2022-12-02 12:23:15 +00:00
cf3c3f2280 Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)"
This reverts commit bcf4292f04eda6c21cab18aa70cad6b2887c8b78.

Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits
2022-12-02 09:57:31 +00:00
0bde810572 Add more debug information for Inductor (#90008)
- Add graph index to the profile information of the Inductor kernel for better debugability.

  The generated code for different graphs could produce kernels with the same name. The side effect is that it is hard to identify the portion of E2E performance for these kernels because the profiler will aggregate the performance with the same kernel name regardless of different graphs. Hence, this PR added the graph index to the profile information to address this limitation.

- Label arbitrary code ranges for `eager` and `opt` modes for better debugability

  The profile information of dynamo benchmarks mixes the eager mode and opt mode. It is hard to separate the range for different modes. This PR added eager and opt marks to the profile information to address this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90008
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 09:34:48 +00:00
6f4dea562d Implement post and pre hooks for optimizer (#89176)
Fixes #88446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89176
Approved by: https://github.com/albanD
2022-12-02 07:03:45 +00:00
adc1a94ef4 Add tests for custom pybind type_casters (#89897)
This is a followup to #89115 which Fixes #88958

This adds tests to verify at runtime that the types returned by custom pybind type_casters are correctly specified in the second argument to `PYBIND11_TYPE_CASTER`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89897
Approved by: https://github.com/ezyang
2022-12-02 07:02:09 +00:00
b703e4b3c2 Add hierarchical module names to torchFX graph.node #87659 (#87742)
Fixes #87659

Pass down the module hierarchy from module.named_modules() to the name field of graph.node.
This makes it so the name of each node contains descriptive information about the network architecture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87742
Approved by: https://github.com/jerryzh168
2022-12-02 05:58:06 +00:00
9dffc56008 Intel compiler support in c10/util/TypeIndex.h (#89610)
Build passed with icc (ICC) 2021.7.1 20221019.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89610
Approved by: https://github.com/kit1980
2022-12-02 05:32:21 +00:00
9013c92a9f [ao] making QConfigMapping print in a user friendly way (#89932)
Summary: added __repr__ to QConfigMapping and QConfigMultiMapping
loosely based on __repr__ for BaseSparsifier

example output:

```
>>> import torch
>>> print(torch.ao.quantization.qconfig_mapping.get_default_qconfig_mapping())
QConfigMapping (
 global_qconfig
  QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
 object_type_qconfigs
  reshape: QConfig(activation=<class 'torch.ao.quantization.observer.ReuseInputObserver'>, weight=<class 'torch.ao.quantization.observer.NoopObserver'>)
  <class 'torch.nn.modules.conv.Conv1d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.conv.Conv2d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.conv.Conv3d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.conv.ConvTranspose1d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <class 'torch.nn.modules.conv.ConvTranspose2d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <class 'torch.nn.modules.conv.ConvTranspose3d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <class 'torch.nn.modules.linear.Linear'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <built-in method conv1d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <built-in method conv2d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <built-in method conv3d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <built-in method conv_transpose1d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <built-in method conv_transpose2d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <built-in method conv_transpose3d of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <built-in function linear>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.activation.ReLU'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <function relu at 0x7f08ad57bc10>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <built-in method relu of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.batchnorm.BatchNorm1d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.batchnorm.BatchNorm2d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <class 'torch.nn.modules.batchnorm.BatchNorm3d'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=functools.partial(<class 'torch.ao.quantization.observer.PerChannelMinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_channel_symmetric){})
  <function layer_norm at 0x7f08ad57fca0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=<class 'torch.ao.quantization.observer.PlaceholderObserver'>)
  <class 'torch.nn.modules.normalization.LayerNorm'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.HistogramObserver'>, reduce_range=True){}, weight=<class 'torch.ao.quantization.observer.PlaceholderObserver'>)
  <class 'torch.nn.modules.activation.Hardsigmoid'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <function hardsigmoid at 0x7f08ad57f670>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  hardsigmoid: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  hardsigmoid_: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <class 'torch.nn.modules.activation.Sigmoid'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <built-in method sigmoid of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  sigmoid: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  sigmoid_: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <class 'torch.nn.modules.activation.Softmax'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.00390625, zero_point=0, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <class 'torch.nn.modules.activation.Tanh'>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  <built-in method tanh of type object at 0x7f08b99497e0>: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  tanh: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
  tanh_: QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.FixedQParamsObserver'>, scale=0.0078125, zero_point=128, dtype=torch.quint8, quant_min=0, quant_max=255){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MinMaxObserver'>, dtype=torch.qint8, qscheme=torch.per_tensor_symmetric){})
 module_name_regex_qconfigs
  OrderedDict()
 module_name_qconfigs
  OrderedDict()
 module_name_object_type_order_qconfigs
  OrderedDict()
)
```

Test Plan: python test/test_quantization.py
TestFXNumericSuiteNShadows.test_qconfig_multi_mapping_repr

python test/test_quantization.py
TestQuantizeFx.test_qconfig_mapping_repr
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89932
Approved by: https://github.com/vkuzo
2022-12-02 05:24:47 +00:00
5f881ac2d1 Adding dispatch alias 'FuncTorchBatchedDecomposition' (#88771)
part of https://github.com/pytorch/functorch/issues/1009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88771
Approved by: https://github.com/zou3519
2022-12-02 04:38:28 +00:00
6addc8d923 [Inductor] add expm1 lowering (#89961)
Improves perf of inductor no-cudagraphs on nvidia-deeprecommender from 0.88 -> .96. I am looking into disabling implicit fallbacks for benchmark models in another pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89961
Approved by: https://github.com/ngimel
2022-12-02 04:29:54 +00:00
42f27c322b TorchDynamo: don't compute index for max_pooling when return_index is false (#89838)
For max_pooling, if return_index  is **False**, we don't need compute the index.

Before:

```
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp12 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp17 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp22 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp27 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp32 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp37 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = static_cast<long>((2*i2) + (14*i1));
                            auto tmp3 = static_cast<long>(1 + (2*i2) + (14*i1));
                            auto tmp4 = tmp2 > tmp0;
                            auto tmp5 = tmp4 ? tmp3 : tmp1;
                            auto tmp6 = (tmp0 != tmp0) ? tmp0 : std::max(tmp2, tmp0);
                            auto tmp8 = static_cast<long>(2 + (2*i2) + (14*i1));
                            auto tmp9 = tmp7 > tmp6;
                            auto tmp10 = tmp9 ? tmp8 : tmp5;
                            auto tmp11 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp13 = static_cast<long>(7 + (2*i2) + (14*i1));
                            auto tmp14 = tmp12 > tmp11;
                            auto tmp15 = tmp14 ? tmp13 : tmp10;
                            auto tmp16 = (tmp11 != tmp11) ? tmp11 : std::max(tmp12, tmp11);
                            auto tmp18 = static_cast<long>(8 + (2*i2) + (14*i1));
                            auto tmp19 = tmp17 > tmp16;
                            auto tmp20 = tmp19 ? tmp18 : tmp15;
                            auto tmp21 = (tmp16 != tmp16) ? tmp16 : std::max(tmp17, tmp16);
                            auto tmp23 = static_cast<long>(9 + (2*i2) + (14*i1));
                            auto tmp24 = tmp22 > tmp21;
                            auto tmp25 = tmp24 ? tmp23 : tmp20;
                            auto tmp26 = (tmp21 != tmp21) ? tmp21 : std::max(tmp22, tmp21);
                            auto tmp28 = static_cast<long>(14 + (2*i2) + (14*i1));
                            auto tmp29 = tmp27 > tmp26;
                            auto tmp30 = tmp29 ? tmp28 : tmp25;
                            auto tmp31 = (tmp26 != tmp26) ? tmp26 : std::max(tmp27, tmp26);
                            auto tmp33 = static_cast<long>(15 + (2*i2) + (14*i1));
                            auto tmp34 = tmp32 > tmp31;
                            auto tmp35 = tmp34 ? tmp33 : tmp30;
                            auto tmp36 = (tmp31 != tmp31) ? tmp31 : std::max(tmp32, tmp31);
                            auto tmp38 = static_cast<long>(16 + (2*i2) + (14*i1));
                            auto tmp39 = tmp37 > tmp36;
                            auto tmp40 = tmp39 ? tmp38 : tmp35;
                            auto tmp41 = (tmp36 != tmp36) ? tmp36 : std::max(tmp37, tmp36);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp41;
                        }
                    }
                }
            }
        }
    }
}
''')
```
After:

```
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89838
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 04:15:45 +00:00
f623b123f0 [Inductor] Do not install g++12 by default (#90038)
Unless `TORCH_INDUCTOR_INSTALL_GXX` environment variable is define
(which is the case for CI)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90038
Approved by: https://github.com/albanD
2022-12-02 04:13:58 +00:00
b058a02786 TorchDynamo: enable convolution bn folding for functional bn (#89746)
Motivation: for Timm model, there is always use customer-defined BN which using F.batch_norm: https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/layers/norm_act.py#L26, and the fx graph will be like:
```
-------------  ----------------------  ---------------------------------------  ---------------------------------------------------------------------------------------------------------  --------
placeholder    x                       x                                        ()                                                                                                         {}
call_module    self_conv               self_conv                                (x,)                                                                                                       {}
get_attr       self_bn_running_mean_1  self_bn_running_mean                     ()                                                                                                         {}
get_attr       self_bn_running_var     self_bn_running_var                      ()                                                                                                         {}
get_attr       self_bn_weight          self_bn_weight                           ()                                                                                                         {}
get_attr       self_bn_bias            self_bn_bias                             ()                                                                                                         {}
call_function  batch_norm              <function batch_norm at 0x7f07196cdf70>  (self_conv, self_bn_running_mean_1, self_bn_running_var, self_bn_weight, self_bn_bias, False, 0.1, 1e-05)  {}
call_module    self_bn_drop            self_bn_drop                             (batch_norm,)
```

the original conv+bn folding path doesn't work for **F.batch_norm**, but for **F.batch_norm** case, if its' parameters are const(attr of the module and will not be updated), we can also do the const folding's optimization. This PR will enable it and will improve the Timm models' performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89746
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-12-02 04:13:34 +00:00
3162a48a77 [dynamo][benchmarks] Call zero grad (#90026)
Hoping that it might reduce some flakiness

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90026
Approved by: https://github.com/williamwen42
2022-12-02 04:05:57 +00:00
63e57280fc [Profiler] Memory profiler part 13: Add sizes to timeline. (#89356)
If we see an allocation the size is unambiguous. Otherwise we have to use sizes and strides to bound the underlying storage.

Differential Revision: [D40868660](https://our.internmc.facebook.com/intern/diff/D40868660/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89356
Approved by: https://github.com/chaekit
2022-12-02 03:55:22 +00:00
6727e537a7 [Profiler] Memory profiler part 12: Emit timeline of memory events. (#89355)
Add a simple interface to get a flat representation of the memory profile.

Differential Revision: [D40868663](https://our.internmc.facebook.com/intern/diff/D40868663/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89355
Approved by: https://github.com/chaekit
2022-12-02 03:55:22 +00:00
342139589c [quant][fx] Add support for matching multiple arguments in patterns (#89986)
Summary:
This PR adds support for matching patterns that has multiple arguments, it's needed for quantization in PyTorch 2.0 early prototype

Before this PR, we only support patterns like:
```
x -> conv -> bn -> relu
(relu, (bn, conv))
```
where each operator has a single node, the code breaks when we want to match a pattern that has an op that has multiple arguments, such as:
```
                           shape \
        transpose -> reshape -> output ->
```
where `reshape` has two arguments

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_match_pattern_with_multiple_args

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89986
Approved by: https://github.com/vkuzo
2022-12-02 03:28:32 +00:00
4176102407 [vision hash update] update the pinned vision hash (#90035)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90035
Approved by: https://github.com/pytorchbot
2022-12-02 03:17:56 +00:00
39937b84cd Change periodic concurrency group (#89850)
it hasnt been running the mem leak check b/c it keeps getting cancelled due to a higher priority job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89850
Approved by: https://github.com/malfet, https://github.com/seemethere
2022-12-02 02:40:27 +00:00
d09c52e4fd [inductor] Deterministic kernel names (#89713)
`node.origins` is a set and does not have an order. Therefore, inductor w and w/o cudagraphs experiments generate different kernel names, making it hard to debug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89713
Approved by: https://github.com/soumith, https://github.com/mlazos, https://github.com/ngimel
2022-12-02 02:37:36 +00:00
8b2f9887bf update quantization doc: add x86 backend as default backend of server inference (#86794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86794
Approved by: https://github.com/jgong5, https://github.com/kit1980
2022-12-02 02:10:25 +00:00
69d7afc799 [LTC] Remove noop_execution_mode_ (#89989)
Summary:
noop_execution_mode_ doesn't seem to be useful anymore. Let's remove it.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89989
Approved by: https://github.com/desertfire, https://github.com/JackCaoG
2022-12-02 01:51:30 +00:00
342d78d1a2 Cache guards once per variable tracker, rather than re-propagating them repeatedly (#89827)
This improves tracing performance of optimizer tracing significantly (2x). In essence this just removes the recursion from propagate because it is not necessary. ListVariables and ConstDictVariables already contain the guards from the items contained in them.

Adds two other optimizations for special cases of `recursively_contains`

helps with https://github.com/pytorch/torchdynamo/issues/1803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89827
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-12-02 01:45:05 +00:00
6efedfd774 Revert D41609017: Multisect successfully blamed D41609017 for test or build failures (#90034)
Summary:
This diff is reverting D41609017
D41609017 has been identified to be causing the following test or build failures:
Tests affected:
- https://www.internalfb.com/intern/test/281475052567659/
- https://www.internalfb.com/intern/test/562950029295825/

Here's the Multisect link:
https://www.internalfb.com/intern/testinfra/multisect/1440332
Here are the tasks that are relevant to this breakage:
T93368156: 5 tests started failing for oncall admarket_predictor_pushmaster in the last 2 weeks
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

Test Plan: NA

Reviewed By: zyan0

Differential Revision: D41656946

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90034
Approved by: https://github.com/awgu
2022-12-02 01:31:50 +00:00
c63afb283c Disable dynamo on optimizer lazy initialization (#89902)
Helps with https://github.com/pytorch/torchdynamo/issues/1803

Separate out the group initialization and disable dynamo on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89902
Approved by: https://github.com/soumith, https://github.com/albanD
2022-12-02 01:15:11 +00:00
d94f5c784c Fix binary testing if torchtrition is mandatory (#90017)
Prep-change for a builder, where torchtrition is installed from custom nightly downloads repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90017
Approved by: https://github.com/seemethere
2022-12-02 01:05:01 +00:00
f628f2ed73 [QNNPACK] Fix Memory Leak in QNNPACK QSoftmax Op (#89544)
Summary:
The deleter of the operator's unique_ptr doesn't get called unless the unique_ptr is created after the op has been created

This fixes the problem reported in
https://fb.workplace.com/groups/pytorch.edge.users/posts/1210708329799458/

Test Plan:
# Testing memory leak fix

**With test code added in D41487340:**
```
cd ~/fbsource/xplat
buck run caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test:qsoftmax_test
```

Before this diff:

```
==2060866==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 608 byte(s) in 1 object(s) allocated from:
    #0 0x41bcd27 in calloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcd27)
    #1 0x405b692 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:77

Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
    #0 0x41bcb7f in malloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcb7f)
    #1 0x405b6a8 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:85

SUMMARY- AddressSanitizer: 1632 byte(s) leaked in 2 allocation(s).
```

After this diff:
- No errors
___

# Testing op correctness

```
cd ~/fbsource/fbcode
buck test caffe2/test/quantization:quantization -- test_qsoftmax
```
Passes
- https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749908834332/

Differential Revision: D41487341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89544
Approved by: https://github.com/mcr229
2022-12-01 23:34:36 +00:00
7bd284495a Add non-reentrant checkpoint to composable APIs (#90015)
Differential Revision: [D41661027](https://our.internmc.facebook.com/intern/diff/D41661027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90015
Approved by: https://github.com/zhaojuanmao
2022-12-01 23:05:55 +00:00
a5430e1067 [UCC] Properly finalize unsuccessful collective posts (#89306)
This PR add a `ucc_collective_finalize` call if `ucc_collective_post` and `ucc_collective_triggered_post` were not successful.
According to the [UCC documentation](https://openucx.github.io/ucc/api/v1.1/html/group___u_c_c___c_o_l_l_e_c_t_i_v_e_s.html):
```
On error, request handle becomes invalid, user is responsible to call ucc_collective_finalize to free allocated resources.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89306
Approved by: https://github.com/kwen2501
2022-12-01 23:01:45 +00:00
063bbeb3ba Revert "[quant] Explictly set default quantized engine instead of relying on the order of supported_qengines (#89804)"
This reverts commit 607ff6f4c10914a2a46bab90577cd083a6b3d46d.

Reverted https://github.com/pytorch/pytorch/pull/89804 on behalf of https://github.com/clee2000 due to breaking tests 607ff6f4c1 https://github.com/pytorch/pytorch/actions/runs/3596841274/jobs/6058297637 trunk label didnt kick off workflows fast enough
2022-12-01 22:39:46 +00:00
29ea1c9c8e [doc] update dtensor readme (#89991)
I fixed some import erros in readme of dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89991
Approved by: https://github.com/wanchaol
2022-12-01 22:16:39 +00:00
6f5945e4bb triton supports devices < 7.0, not 6.0 (#90020)
triton is still buggy with Pascal devices, so make the error checker reflect that.

Also, this < 6.0 never worked, as the `has_triton` definition in utils.py was checking >= 7.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90020
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2022-12-01 22:01:41 +00:00
607ff6f4c1 [quant] Explictly set default quantized engine instead of relying on the order of supported_qengines (#89804)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/86404

Test Plan:
ossci + sandcastle
Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41635738](https://our.internmc.facebook.com/intern/diff/D41635738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89804
Approved by: https://github.com/andrewor14
2022-12-01 21:52:59 +00:00
d04480a6b5 [Vulkan][TCC] Add tests for quantized add, sub, mul and div (#89578)
Summary: Added randomized test for quantized add, sub, mul and div

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Differential Revision: D41047094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89578
Approved by: https://github.com/digantdesai
2022-12-01 21:38:27 +00:00
8aee768025 [quant][be] Merge qconfig_mapping_utils.py in quantization and fx folders (#89979)
Summary:
att, no functionality changes

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89979
Approved by: https://github.com/vkuzo
2022-12-01 21:25:53 +00:00
0ad6715b7b [aarch64] add sleef_arm dependency (#89988)
Reviewed By: kimishpatel, psaab

Differential Revision: D41601965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89988
Approved by: https://github.com/soumith
2022-12-01 21:10:53 +00:00
07be48de37 [chalf] relax tolerance : conv_transpose2d (#89993)
Fixes https://github.com/pytorch/pytorch/issues/87332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89993
Approved by: https://github.com/lezcano
2022-12-01 21:03:14 +00:00
ca5526cf1f [tp] ufmt test/distributed/tensor (#89970)
formatting stack to make dtensor and tp align with pytorch format standard.

cmd: `ufmt format test/distributed/tensor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89970
Approved by: https://github.com/fduwjj
2022-12-01 20:58:16 +00:00
9b5e6b029f [tp] umft distributed.tensor.parallel (#89969)
cmd: `ufmt format torch/distributed/tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89969
Approved by: https://github.com/fduwjj
2022-12-01 20:58:16 +00:00
c37c5163da [dtensor] ufmt test/distributed/_tensor (#89968)
cmd: `ufmt format test/distributed/_tensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89968
Approved by: https://github.com/fduwjj
2022-12-01 20:58:15 +00:00
bf23e0bdbd [dtensor] ufmt distributed._tensor (#89967)
cmd: `ufmt format torch/distributed/_tensor`

copy from Andrew:

Notes
For VSCode users,

Install ufmt: https://pypi.org/project/ufmt/
Install VSCode ufmt extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
Include in settings.json:
```
{
    "[python]": {
        "editor.defaultFormatter": "omnilib.ufmt",
        "editor.formatOnSave": true,
    },
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89967
Approved by: https://github.com/fduwjj
2022-12-01 20:58:13 +00:00
768bd3fb4a Add torch.compile implementation (#89607)
`torch.compile` can be used either as decorator or to optimize model directly, for example:
```
@torch.compile
def foo(x):
  return torch.sin(x) + x.max()
```
or
```
mod = torch.nn.ReLU()
optimized_mod = torch.compile(mod, mode="max-autotune")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89607
Approved by: https://github.com/soumith
2022-12-01 20:17:52 +00:00
bcf4292f04 Revert "Dynamo, FX, Inductor Progress Bars (#88384)" (#90018)
This breaks in environments that use the fake tqdm 015b05af18/torch/hub.py (L26) which doesn't support the 'desc' kwarg and is not iterable

Original try using pytorchbot did not go through because of a merge
conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489

This reverts commit 011452a2a1c745d4b12f83f89eca039f482d134b.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018
Approved by: https://github.com/drisspg, https://github.com/dbort
2022-12-01 20:17:07 +00:00
015b05af18 Editorial pass on Dyamo docs (#89921)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89921
Approved by: https://github.com/msaroufim
2022-12-01 18:53:16 +00:00
b2f340557a [ONNX] Supports scatter_add with different static shape of src and index (#89787)
Prior to this change, the converter doesn't support `scatter_add` with different shape of `src` and `index`, while [it's claimed to be supported by PyTorch](https://pytorch.org/docs/stable/generated/torch.Tensor.scatter_add_.html#torch.Tensor.scatter_add_) in a way that scatter shape would be accommodated to index shape. This PR adds `onnx::Slice` to adjust the shape of `src` when a static and mismatched shape is found. However, if both of the shape (src and index) is set to dynamic, they are expected to be the same shape from ONNX due to the spec. More ScatterElements details on https://github.com/onnx/onnx/issues/4672
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89787
Approved by: https://github.com/BowenBao
2022-12-01 18:25:22 +00:00
d80056312a [Quant][fx][bc-breaking] Rename fx/*patterns.py (#89872)
Summary: This commit renames fx/quantization_patterns.py
to fx/quantize_handler.py, and fx/fusion_patterns.py to
fx/fuse_handler.py. This is because these files contain
only QuantizeHandler and FuseHandler respectively, so the
new names are more descriptive. A future commit will
further break BC by removing all the empty *QuantizeHandler
classes.

BC-breaking notes:

The following classes under the
`torch.ao.quantization.fx.quantization_patterns` namespace
are migrated to the `torch.ao.quantization.fx.quantize_handler`
namespace:
```
QuantizeHandler
BinaryOpQuantizeHandler
CatQuantizeHandler
ConvReluQuantizeHandler
LinearReLUQuantizeHandler
BatchNormQuantizeHandler
EmbeddingQuantizeHandler
RNNDynamicQuantizeHandler
DefaultNodeQuantizeHandler
FixedQParamsOpQuantizeHandler
CopyNodeQuantizeHandler
GeneralTensorShapeOpQuantizeHandler
CustomModuleQuantizeHandler
StandaloneModuleQuantizeHandler
```

The following classes under the
`torch.ao.quantization.fx.fusion_patterns` namespace are
migrated to the `torch.ao.quantization.fx.fuse_handler`
namespace:
```
DefaultFuseHandler
FuseHandler
```

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89872
Approved by: https://github.com/jerryzh168
2022-12-01 17:37:07 +00:00
314e7c37c3 fix citation file in MANIFEST (#89994)
#86200 changed the `CITATION` file to `CITATION.cff`, but this change was not reflected in the `MANIFEST.in`. Meaning, `CITATION.cff` will not be included in wheels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89994
Approved by: https://github.com/malfet
2022-12-01 15:21:54 +00:00
a5532929da Remove DDP import (#89982)
This import is only used for typing, removing it to avoid circular ref
in next diffs

Differential Revision: [D41636897](https://our.internmc.facebook.com/intern/diff/D41636897/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89982
Approved by: https://github.com/zhaojuanmao
2022-12-01 14:56:48 +00:00
5a36d99845 Add error repro test for FSDP ignored modules with mixed precision (#89971)
The ignored modules are still using the original precision, which
leads to the following error.

```
RuntimeError: mat1 and mat2 must have the same dtype
```

This is not blocking me at the moment, but the fix seems not too
hard. We can add a pre-forward hook to each ignored module to
convert activations to original precision, and a post-forward hook
to convert it back to the specified precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89971
Approved by: https://github.com/awgu
2022-12-01 14:56:40 +00:00
dfb533ca5b add vjp test with non-contig inputs (#89375)
Ref: https://github.com/pytorch/functorch/issues/1029

We update `test_vjp` to do contiguous and non-contiguous sample testing.

Prev Time: ~32s
New Time : ~50s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89375
Approved by: https://github.com/zou3519
2022-12-01 14:43:30 +00:00
99dac4dd48 Type torch._dynamo.guards (#89919)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89919
Approved by: https://github.com/albanD
2022-12-01 13:43:10 +00:00
e03cde07e4 Guarantee symbol allocation for all sizes/strides/storage offset (#89879)
We may need to express guards on the size/stride/storage offset of
a tensor, but we cannot do this if it's already been duck sized.
This PR guarantees that we allocate a symbol (or negation of the
symbol) whenever we ask to create a SymInt, and propagates this
symbol to SymNode so that Dynamo can look at it (not in this PR).

This PR doesn't actually add guards, nor does Dynamo do anything
with these symbols.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89879
Approved by: https://github.com/albanD
2022-12-01 13:43:10 +00:00
74bcf2b604 Add definitely_not_01 set to ShapeEnv. (#89871)
This set tracks symbols which we know are definitely not 0/1, and thus
can be further simplified when we try to work out their static value
without guards.  Right now, all allocated symbols are in this set,
but we will later add symbols which don't uphold this.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89871
Approved by: https://github.com/albanD
2022-12-01 13:43:10 +00:00
8d333761a9 When dealing with dupe arguments, prefer leafifying if possible (#89896)
See code comment for details. I also had to do some extra fixes:

* `run_functionalized_fw_and_collect_metadata` now is able to handle duplicated arguments
* `aot_wrapper_dedupe` now always returns boxed compiled functions
* `aot_wrapper_dedupe` is now applied to inference compiler along with autograd compiler (preexisting)

Fixes https://github.com/pytorch/torchdynamo/issues/1939
Fixes DebertaV2ForQuestionAnswering DebertaForMaskedLM DebertaForQuestionAnswering DebertaV2ForMaskedLM

Repro command:

```
python benchmarks/dynamo/huggingface.py --performance --float32 -dcuda --training --inductor --no-skip --dashboard --only DebertaForQuestionAnswering --cold_start_latency
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89896
Approved by: https://github.com/bdhirsh
2022-12-01 13:42:29 +00:00
808cb2e86d [FSDP][Dynamo] Define annotation attributes as globals (#89913)
This was separated out from the previous PR to decouple. Since not all builds include `torch.distributed`, we should define the globals in the dynamo file and import to distributed instead of vice versa. Unlike the version from the previous PR, this PR prefixes the globals with `_` to future proof against `_dynamo/` eventually becoming public.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89913
Approved by: https://github.com/wconstab
2022-12-01 13:25:54 +00:00
4095ef8b80 remove torch.equal usages (#89527)
Preparation for the next PR in this stack: #89559.

I replaced

- `self.assertTrue(torch.equal(...))` with `self.assertEqual(..., rtol=0, atol=0, exact_device=True)`,
- the same for `self.assertFalse(...)` with `self.assertNotEqual(...)`, and
- `assert torch.equal(...)` with `torch.testing.assert_close(..., rtol=0, atol=0)` (note that we don't need to set `check_device=True` here since that is the default).

There were a few instances where the result of `torch.equal` is used directly. In that cases I've replaced with `(... == ...).all().item()` while sometimes also dropping the `.item()` depending on the context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89527
Approved by: https://github.com/mruberry
2022-12-01 11:22:52 +00:00
0acbcef4ab fix assert_close docstring (#89620)
Two improvements here:

1. To render a bullet list correctly, a blank line before and after is needed. Compare

    ![Screenshot from 2022-11-24 09-34-10](https://user-images.githubusercontent.com/6849766/203732792-18071831-c7d9-4138-9002-e67e29f342fa.png)

    vs.

    ![Screenshot from 2022-11-24 09-34-52](https://user-images.githubusercontent.com/6849766/203732806-1ded7a4b-ca30-46c8-89a2-5c83ea33dbe7.png)

2. #72508 added proper support for meta tensors. Thus, we no longer throw an error if we encounter them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89620
Approved by: https://github.com/kit1980
2022-12-01 11:22:52 +00:00
d72cd4c4e5 document torch.testing.assert_allclose (#89526)
After our failed attempt to remove `assert_allclose` in #87974, we decided to add it to the documentation after all. Although we drop the expected removal date, the function continues to be deprecated in favor of `assert_close`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89526
Approved by: https://github.com/mruberry
2022-12-01 11:22:50 +00:00
4baa78bb1f enable ufmt for torch/testing/*.py (#89525)
I've tried to soft-enforce this manually already, albeit with a line length of 120. This just adds it to the CI. Note that this only applies to `torch/testing/*.py` and thus everything under `torch/testing/_internal/**/*` is *not* affected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89525
Approved by: https://github.com/kit1980
2022-12-01 11:22:48 +00:00
850b53bbee ad more error info for cublasLtMatmul (#89983)
hit an error at 'cublasLtMatmul' when running bfloat16 for a complicate model, this error info will help debugging and also is  good for future error reporting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89983
Approved by: https://github.com/ngimel
2022-12-01 06:34:13 +00:00
a747326423 Add manual meta implementations to quantize_per_tensor.tensor and co (#89958)
When you are writing a meta function, you cannot call item() on the tensor because there is no real data on the tensor and it will fail. The error message was not very good in this case, see also https://github.com/pytorch/pytorch/issues/89959

This PR takes a brute force approach to resolving the problem: just manually define meta implementations for the naughty functions that are calling item(). However, this results in a lot of code duplication. The easiest way to avoid this situation is to rewrite the decomps so they don't call item. It should not be that difficult to use direct tensors on your operations, as scalar tensors can broadcast too.

I could only test this with `buck test @mode/opt -c python.package_style=inplace //executorch/backends/test:test_backends` in internal with D41555454. Test coverage needs to be improved, otherwise don't blame us when we break you.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89958
Approved by: https://github.com/jerryzh168
2022-12-01 06:04:37 +00:00
f1978b18f9 add mixed data type support for LayerNorm (#81851)
1. If user uses amp to run bfloat16 models, `torch.autocast` will
keep module paramters in acc dtype which will leave `gamma` and`beta`
in float while input/output will be in bfloat16.

2. If user explicitly cast the model to bfloat16 such as:
```
  x = torch.randn(n, t, c).bfloat16()
  ln = nn.LayerNorm(c).bfloat16()
  y = ln(x)
```
The input/output and gamma/beta will all be in bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81851
Approved by: https://github.com/ezyang
2022-12-01 04:48:34 +00:00
b6d6c6933e [vision hash update] update the pinned vision hash (#89749)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89749
Approved by: https://github.com/pytorchbot
2022-12-01 04:01:32 +00:00
b399acd2dd [codemod][llvm15] LLVM-15 fixes for caffe2/caffe2/video/video_decoder.cc (#89937)
Summary: This fixes issues which block `caffe2/caffe2/video/video_decoder.cc` from compiling with LLVM-15.

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D41603386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89937
Approved by: https://github.com/soumith
2022-12-01 03:46:22 +00:00
2f5532a90e [codemod][llvm15] LLVM-15 fixes for caffe2/caffe2/video/video_decoder.h (#89940)
Summary: This fixes issues which block `caffe2/caffe2/video/video_decoder.h` from compiling with LLVM-15.

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D41603451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89940
Approved by: https://github.com/soumith
2022-12-01 03:39:31 +00:00
cc01614186 [codemod][llvm15] LLVM-15 fixes for caffe2/test/cpp/jit/test_graph_executor.cpp (#89936)
Summary: This fixes issues which block `caffe2/test/cpp/jit/test_graph_executor.cpp` from compiling with LLVM-15.

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D41603459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89936
Approved by: https://github.com/soumith
2022-12-01 03:30:31 +00:00
6317311e61 [inductor] Disable parallel compilation inside fbcode (#89926)
Forking python processes using `multiprocessing` doesn't play nicely
with certain aspects of FB infra, so let's disable it until we find a better
solution.

Differential Revision: [D41618774](https://our.internmc.facebook.com/intern/diff/D41618774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89926
Approved by: https://github.com/desertfire
2022-12-01 02:33:45 +00:00
8d8a215d4c [Vulkan][TCC] Helper functions for vulkan quantized tests (#89922)
Summary: Helper functions for producing random inputs/scale/zero points and also computing suitable scale and zero points of a tensor, used in the testing of quantized ops.

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: kimishpatel

Differential Revision: D41595034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89922
Approved by: https://github.com/digantdesai
2022-12-01 02:10:51 +00:00
a61450726f Minor fix for dynamo xla integration test (#89891)
Fix the test before I added them to the xla CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89891
Approved by: https://github.com/kit1980, https://github.com/shunting314
2022-12-01 02:10:36 +00:00
4bae860813 quantization: make x86 as default backend (#88799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88799
Approved by: https://github.com/kit1980
2022-12-01 02:09:54 +00:00
0e7918b931 fix mkldnn quantization issue for weight reorder error (#86876)
Differential Revision: [D40351062](https://our.internmc.facebook.com/intern/diff/D40351062)

For mkldnn quantization path, we will do weight prepack using dummy data to query the expected weight format, the packed weight's format may differ from the real input case(the weight format depends on the input's shape), and there will have a block weight to block weight reorder if the packed weight format differs with the expected weight format.  The mkldnn may meet the following issue when doing such reorder(test on ICX machine):

```
test_conv_reorder_issue_onednn
    torch.ops.quantized.conv2d(qx, w_packed, output_scale=1.0, output_zero_point=0)
  File "/home/weiwen/.conda/envs/int8-dev/lib/python3.9/site-packages/torch/_ops.py", line 472, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: could not create a primitive descriptor for a reorder primitive
```

This PR will fix it: if the block weight to block weight reorder is failed, we will reorder the block weight to plain weight first, and then reorder the plain weight to the target block weight.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86876
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-01 02:00:29 +00:00
6372f11d8d RowwiseMoments: use float as acc type for bfloat16 inputs (#84405)
To fix https://github.com/pytorch/pytorch/issues/77507

Originally `utils::RowwiseMoments<BFloat16>` will still accululate on BFloat16,
which is not only slow but also introducing additional rounding errors.

This patch will do accumulation on float for the bfloat16 inputs:
each of bfloat16 vec (size 16) will be converted to two float vec (size 8),
and accumulated on m1(mean) and m2(rstd) vecs which are all float vecs.

No effect on float performance, will improve bfloat16 performance:
* avx512 single socket:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.215 ms; bf16: 0.178 ms
```
* avx512 single core:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.618 ms; bf16: 2.309 ms
```
* avx2 single socket:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.527 ms; bf16: 0.458 ms
```
* avx2 single core:
```
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.416 ms; bf16: 3.524 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84405
Approved by: https://github.com/jgong5
2022-12-01 01:58:59 +00:00
ad1585b4a4 [Checkpoint] Minor update to checkpoint utils (#89964)
Change to only print temp directory once on rank0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89964
Approved by: https://github.com/XilunWu
2022-12-01 01:55:53 +00:00
a43e09c064 Implement gamma cdf (#89955)
Authored by tillahoffmann originally at https://github.com/pytorch/pytorch/pull/72518

Implements the cumulative distribution function for the gamma distribution. The tests needed a small adjustment to pass because gradients cannot be evaluated with respect to the first argument of the incomplete gamma function (and they're not needed for the test).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89955
Approved by: https://github.com/wconstab, https://github.com/malfet
2022-12-01 00:12:53 +00:00
5167108c1a Add device note to the docs of sparse tensor factory functions (#89910)
Fixes #89402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89910
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2022-12-01 00:06:38 +00:00
11db12bd94 Issue 68576 prefetch factor docstring changes (#89874)
Fixes #68576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89874
Approved by: https://github.com/kit1980
2022-11-30 23:42:56 +00:00
7cf0913909 Correct the label for quantization PRs (#89888)
Summary:
att

Test Plan:
NA

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89888
Approved by: https://github.com/andrewor14
2022-11-30 23:06:49 +00:00
1ccaa2a5f7 [EASY] Replace direct use of Guard ctor with make_guard (#89945)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89945
Approved by: https://github.com/albanD
2022-11-30 22:45:09 +00:00
4451eb24e6 Move tensor_parallel out to distributed.tensor folder (#89878)
This PR moves tensor parallel from torch.distributed._tensor.parallel
to torch.distributed.tensor.parallel, to prepare for beta release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878
Approved by: https://github.com/fduwjj
2022-11-30 22:13:10 +00:00
8a760ea922 Subscribing janeyx99 to optimizer PRs (#89943)
Adding myself to keep updated with what's up in the world of optimizers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89943
Approved by: https://github.com/albanD
2022-11-30 22:07:32 +00:00
5a82c79024 Small fix for torch._C.Graph type hint (#89821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89821
Approved by: https://github.com/kit1980
2022-11-30 21:48:09 +00:00
dfbc4e5473 [Easy][FSDP] Fix pyre error (#89930)
This PR attemps to fix the following pyre error:

```
Incompatible parameter type [6]: In call
`dist.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel.__init__`,
for 7th parameter `auto_wrap_policy` expected
`Optional[typing.Callable[..., typing.Any]]` but got
`Optional[_FSDPPolicy]`.
```

Besides, this also removes the type inconsistency in code and docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89930
Approved by: https://github.com/awgu
2022-11-30 21:33:00 +00:00
0c3537a3c3 Add dynamo smoke tests to CI (#89302)
Add dynamo smoke tests to CI, which checks for python/torch/cuda versions and runs simple dynamo examples on a few backends, including inductor. Smoke tests will run on dynamo and inductor shards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89302
Approved by: https://github.com/malfet
2022-11-30 21:24:45 +00:00
9e4a25c731 [quant][decomposed] Add support for int32 for decomposed q/dq ops (#89881)
Summary:
att

Test Plan:
python test/test_quantization.py -k test_decomposed_quantize_per_tensor
python test/test_qunatization.py -k test_decomposed_dequantize_per_tensor

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89881
Approved by: https://github.com/cccclai
2022-11-30 21:24:00 +00:00
62f01e2b26 [FIX][QAT] Switch to use kwargs when args is empty (#89778)
Summary:
When `ref_node.args` is empty, the QAT will throw index out of range. Here is an example, line 574 is using `tensors = ....` in torch.cat func, which will be treated as `kwargs`
{F800357376}

f388506954

To fix the issue, we will use the value of the first kwarg if args is empty

Test Plan: f388545532

Reviewed By: bigning, lyoka

Differential Revision: D41396771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89778
Approved by: https://github.com/lyoka, https://github.com/houseroad
2022-11-30 21:15:21 +00:00
0bc19e77d2 [quant][be] Simplify insert_observers_for_model in fx/prepare.py (#89887)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89887
Approved by: https://github.com/andrewor14
2022-11-30 21:09:14 +00:00
76e869c911 [BE] Beef up test_functionalization to test functionalizing multi-parameter functions (#89798)
Previously, `assert_functionalization` only took in uni-Tensor-parameter functions. This PR beefs up the check to allow for functions that take multiple parameters.

This PR also changes the test_instance_norm test to check that the multiparam change works.

## Test plan
Locally tested, CI should also pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89798
Approved by: https://github.com/samdow
2022-11-30 20:46:16 +00:00
4144ad16af add XPU backend to support torch.save and torch.load (#89679)
# Motivate
We need to add XPU backend to support torch.save and torch.load when parameter _use_new_zipfile_serialization=False.

# Solution
We give a design via wrap data as a tensor:
>1. and use an in-place copy for H2D
>2. directly call a tensor.to() for D2H.

This can help us:
>1. unify the generic code for all backends.
>2. support all the non-CPU device backends.

# Additional Context
No need more UT.
test/test_serialization.py will cover this code change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89679
Approved by: https://github.com/ezyang
2022-11-30 20:38:02 +00:00
6fb8423904 [FSDP] Slightly refactor fx symbolic tracer (#89917)
I made a pass over Linjian's `_symbolic_trace.py` and tidied it up a bit. Aside from simple stylistic changes, this PR makes the following changes:
- Save `visited_params: Set[nn.Parameter]` to avoid linear overhead to check a parameter already being visited when appending to the parameter execution order list (`param_forward_order`)
- Move the tracer patching logic to a class `_ExecOrderTracer` to have a reference to `self.exec_info` without having a fragmented 2-step initialization (like the old `_init_execution_info(root_module)` plus `_patch_tracer(tracer, root_module, execution_info)`)
- Define `_ParamUsageInfo` to formalize the `Tuple[nn.Module, List[str, nn.Parameter]]` elements being mapped to in the execution info `dict`, and clarify the documentation regarding what this represents
- Change the unit test to use `TestCase`, not `FSDPTest`, to avoid initializing a process group

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89917
Approved by: https://github.com/zhaojuanmao, https://github.com/fegin
2022-11-30 20:31:55 +00:00
89769d84eb [FSDP][BE] Move dynamo annotation to separate file (#89890)
This PR makes two minor changes: It (1) moves the recently-added module annotation logic for dynamo support to a separate file `torch/distributed/fsdp/_dynamo_utils.py` and ~~(2) saves the annotated attribute names to global variables `FSDP_MANAGED_MODULE` and `FSDP_USE_ORIG_PARAMS`~~.
Update: Since the distributed package may not be included in some builds, it is not safe to import from `torch.distributed...` to a file in `_dynamo/`. I will not include change (2) in this PR. The alternative is to define those globals (privately) in the dynamo file and import from there in the FSDP file.
- The first change is mainly a personal choice, where I wanted to avoid the dynamo explanation from dominating the FSDP constructor space-wise. I added the `(see function for details)` to the inline comment to forward interested readers.
- The second change follows the custom we have taken in the past for such attributes (e.g. `FSDP_FLATTENED`). My understanding (in the past as well as currently) is that this is a good practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89890
Approved by: https://github.com/wconstab
2022-11-30 20:29:41 +00:00
76c6dfeaa6 Add layout and blocksize arguments to Tensor.to_sparse method (#89502)
This PR extends the `Tensor.to_sparse()` method to `Tensor.to_sparse(layout=None, blocksize=None)` in a BC manner (`layout=None` means `layout=torch.sparse_coo`).

In addition, the PR adds support for the following conversions:
- non-hybrid/hybrid COO tensor to CSR or CSC or a COO tensor
- short, bool, byte, char, bfloat16, int, long, half CSR tensor to a BSR tensor

and fixes the following conversions:
- hybrid COO to COO tensor
- non-batch/batch hybrid BSR to BSR or BSC tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89502
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2022-11-30 20:21:10 +00:00
f2308b1da6 [MPS] Enable fp16 for linear backward (#89774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89774
Approved by: https://github.com/albanD, https://github.com/malfet
2022-11-30 20:00:32 +00:00
b5ad90932a [jiterator, complex32] lerp : cuda (#75584)
Follows #74748 and #74537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75584
Approved by: https://github.com/anjali411
2022-11-30 19:07:30 +00:00
26054c1607 beef up inplace/view note on copy slices (#89856)
Follow up doc update from https://github.com/pytorch/pytorch/pull/89812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89856
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2022-11-30 18:35:52 +00:00
b7c42b4066 [FSDP][Easy] ufmt test_fsdp_checkpoint.py (#89916)
I am in the habit now to run `ufmt format test/distributed/fsdp` before committing, and this changed `test_fsdp_checkpoint.py`. I separated this into its own PR. This change should be safe to force merge to save CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89916
Approved by: https://github.com/mrshenli
2022-11-30 18:31:43 +00:00
6e8e7b9407 Fix binary ios builds (#89929)
curl on CircleCI MacOS runners does not support `--retry-all-errors`

Should fix https://app.circleci.com/pipelines/github/pytorch/pytorch/616842/workflows/5d1162c8-eeae-4627-a1b2-17b493b15b59/jobs/17230369?invite=true#step-105-62

Cleanup after https://github.com/pytorch/pytorch/pull/89157 that were missed by https://github.com/pytorch/pytorch/pull/89298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89929
Approved by: https://github.com/seemethere, https://github.com/atalman
2022-11-30 18:25:47 +00:00
1207b0e474 Update Reviewers for PyTorch Distributed team (#89889)
- Reflect PyTorch Distributed team member change on the merge rule
- Added new team members since 2021
- Removed one member no longer on PyTorch Distributed team
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89889
Approved by: https://github.com/soumith
2022-11-30 17:56:19 +00:00
09f2373ec0 Fix TODOs related to #38095 in test_mps.py (#89815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89815
Approved by: https://github.com/weiwangmeta, https://github.com/kulinseth
2022-11-30 17:00:36 +00:00
f1415b8cb6 Revert "Call _sdp_attention in nn.functional.mha (#89470)"
This reverts commit 4d7ec302202caaf35bb8c997d035c54f0c24e192.

Reverted https://github.com/pytorch/pytorch/pull/89470 on behalf of https://github.com/jeanschmidt due to breaking internal builds
2022-11-30 16:16:24 +00:00
618a585f6c Revert "replace double transpose with single permute in nn.f.mha (#89847)"
This reverts commit b9afa928271dfd6b80ddb2367fa1c4f4aba25fe4.

Reverted https://github.com/pytorch/pytorch/pull/89847 on behalf of https://github.com/jeanschmidt due to Need to revert this commit as it is causing conflict when reverting #89470
2022-11-30 16:03:48 +00:00
a6caa9c54b Add a cpp wrapper for Inductor (#88167)
## Description
Implements https://github.com/pytorch/torchdynamo/issues/1556.
This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting:
```python
from torch._inductor import config
config.cpp_wrapper = True
```

### Example
The main part of the generated code:
```python
from torch.utils.cpp_extension import load_inline
wrapper = (
'''
#include <dlfcn.h>
#include <assert.h>
    std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) {
    at::Tensor arg0_1, arg1_1;
    std::tie(arg0_1, arg1_1) = args;
    auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float);
    auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float);
    auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW);
    assert(kernel0_lib != nullptr);
    void (*kernel0)(const float*,const float*,float*,float*);
    *(void **) (&kernel0) = dlsym(kernel0_lib, "kernel");
    kernel0((float*)(arg0_1.data_ptr()), (float*)(arg1_1.data_ptr()), (float*)(buf0.data_ptr()), (float*)(buf1.data_ptr()));
    arg0_1.reset();
    arg1_1.reset();
    return std::make_tuple(buf0, buf1); }''' )

module = load_inline(
    name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu',
    cpp_sources=[wrapper],
    functions=['call_0'],
    extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'],
    extra_ldflags=['-shared  -lgomp'],
    extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m'])

def _wrap_func(f):
    def g(args):
        return f(args)
    return g
call = _wrap_func(module.call_0)
```

### Next steps
The below items will be addressed in upcoming PRs.
- [x] Support Reduction: #88561
- [x] Support None: #88560
- [ ] Support ExternKernel
   - [x] ATen GEMM-related OPs: #88667
   - [ ] ATen Conv
   - [ ] Conv/GEMM fusion OPs
- [x] Cache the kernel loading part: #89742
- [ ] De-allocate input buffers when possible by leveraging CPython APIs
- [ ] Support Constant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2022-11-30 13:40:47 +00:00
5949d5fed5 [FSDP][Easy] Remove internal default arg (#89227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89227
Approved by: https://github.com/mrshenli
2022-11-30 13:34:05 +00:00
7cd6e6acad add bf16 in fp32 out fast path for embedingbag in caffe2 perfkernel (#89198)
Add BF16 in FP32 out kernel into Caffe2 emb perfkernels. And also update the python code-gen files to generate the kernel.
The ut will be covered in the next PR(#89199) in this stack ( Tested by nn.EmbeddingBag with BF16 data type)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89198
Approved by: https://github.com/jgong5, https://github.com/kit1980
2022-11-30 13:06:13 +00:00
68805b08d1 [benchmarks][dynamo] Trying CI - Set train() for TIMM models accuracy tests (#89780)
Moving to train mode for TIMM models and also raising batch size for accuracy testing.

Raising batch size seems to remove a lot of noise/instability coming from batch_norm decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89780
Approved by: https://github.com/ngimel
2022-11-30 12:57:35 +00:00
969a7d09f6 Revert "[aarch64] add SLEEF dependency for aten_cpu (#89475)"
This reverts commit 3cef87f9fd59adb681d910b8edbc1f33e0be5ad2.

Reverted https://github.com/pytorch/pytorch/pull/89475 on behalf of https://github.com/jeanschmidt due to breaking internal builds
2022-11-30 12:06:18 +00:00
4cc5be3a06 Revert "Add bits tensor types (#88594)"
This reverts commit f3b1315eee92ac108f9ceacafaf4ad560c78769d.

Reverted https://github.com/pytorch/pytorch/pull/88594 on behalf of https://github.com/jeanschmidt due to breaking internal builds
2022-11-30 11:37:56 +00:00
296e1ba4d0 Row and column select support for block compressed sparse tensors (#88733)
As in the title:

- Support `select` and `select_copy` on block sparse compressed tensors
- Fixes incorrect results when selecting dense dimensions

The PR also improves the performance of indexing sparse compressed tensors considerably:

<details>

Before:

```python
In [3]: a=torch.rand((1000, 1000)).to_sparse_csr()

In [4]: %timeit a.select(0, 0)
606 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit a.select(1, 0)
527 µs ± 57.7 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit a[0, 0]
617 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: a = a.cuda()

In [8]: %timeit a.select(0, 0); torch.cuda.synchronize();
1.19 ms ± 137 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit a.select(1, 0); torch.cuda.synchronize();
1.2 ms ± 119 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit a[0, 0]; torch.cuda.synchronize();
1.23 ms ± 482 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

This PR:

```python
In [3]: a=torch.rand((1000, 1000)).to_sparse_csr()

In [4]: %timeit a.select(0, 0)
4.75 µs ± 8.94 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [5]: %timeit a.select(1, 0)
565 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit a[0, 0]
13.1 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: a = a.cuda()

In [8]: %timeit a.select(0, 0); torch.cuda.synchronize();
21.6 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: %timeit a.select(1, 0); torch.cuda.synchronize();
1.15 ms ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit a[0, 0]; torch.cuda.synchronize();
63.7 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88733
Approved by: https://github.com/nikitaved, https://github.com/amjames, https://github.com/cpuhrsch
2022-11-30 11:15:56 +00:00
0cc0e5ef65 [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987)
This PR includes:

Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
2022-11-30 08:19:41 +00:00
87d18cf0e7 fix RowwiseMoments vectorization issue on CPU (#84404)
Originally `cpu/moments_utils.h` uses namespace of at::native::utils,
this file contains `Vectorized<>`, in order to make it properly vectorized
on different archs, need to use anonymous namespace or inline namespace.
Otherwise it would be linked to scalar version of the code.

This PR is to fix vectorization issue from `RowwiseMoments` which is used to calculate `mean` and `rstd` in norm layers.
Attach benchmark data, generally fp32 will get 2-3x speedup and bf16 has larger speedup.

This patch will improves layer_norm (input size 32x128x1024) float32 inference:
* avx512 single socket: 2.1x
```bash
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.439 ms; bf16: 2.479 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.210 ms; bf16: 0.770 ms
```
* avx512 single core: 3.2x
```bash
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 6.308 ms; bf16: 39.765 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 2.661 ms; bf16: 12.267 ms
```
* avx2 single socket: 2.3x
```bash
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 1.248 ms; bf16: 8.487 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 0.540 ms; bf16: 2.030 ms
```
* avx2 single core: 2.5x
```bash
before: LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 10.792 ms; bf16: 66.366 ms
after:  LayerNorm((1024,), eps=1e-05, elementwise_affine=True) : 32x128x1024: fp32: 4.349 ms; bf16: 19.252 ms
```

Attached some original VTune profiling results here to further indicate the issue:

1. original bottlenecks
![master_bottleneck](https://user-images.githubusercontent.com/20233731/180125611-deed41b7-dd2e-4437-a7d9-6ad0096e5850.png)

we can see `RowwiseMomentsImpl<>` takes majority of the runtime here.

2. Instruction level breakdown of `RowwiseMomentsImpl<>`
![rowwise_momentum_impl](https://user-images.githubusercontent.com/20233731/180125759-a3b48bc4-8e54-4219-92b4-defde5e86046.png)

we can see it's all **scalar** instructions here.

3. after the fix, the bottlenecks
![fixed_bottleneck](https://user-images.githubusercontent.com/20233731/180125880-8d08eb1b-af09-4f80-ae58-80215365d407.png)

getting better.

4. after the fix, Instruction level breakdown of `RowwiseMomentsImpl<>`
![fixed_rowwsie_momentum_impl](https://user-images.githubusercontent.com/20233731/180125989-b45db4ad-e6ed-460a-8d51-74fbeecf8b02.png)

now it is all **vectorized** instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84404
Approved by: https://github.com/jgong5
2022-11-30 07:55:47 +00:00
92f08f09d8 Vectorize erf (#89837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89837
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2022-11-30 06:42:36 +00:00
009dd3c4af [PT-D][Tensor Parallel] Add more test cases when we use use_orig_params for FSDP wrapping (#89779)
Differential Revision: [D41600656](https://our.internmc.facebook.com/intern/diff/D41600656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89779
Approved by: https://github.com/wanchaol
2022-11-30 06:34:58 +00:00
011452a2a1 Dynamo, FX, Inductor Progress Bars (#88384)
There are 3 progress bars each gated behind their own config, all off by default for now
1. Dynamo: Macro level config for dynamo, AOT, inductor
2. FX: Progress bar for each pass, with their names
3. Inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384
Approved by: https://github.com/wconstab, https://github.com/mlazos
2022-11-30 06:07:14 +00:00
d88b555577 [Dynamo] Fix source/reconstruction bugs in NNModule named_* calls (#89729)
Fixes https://github.com/pytorch/torchdynamo/issues/1931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89729
Approved by: https://github.com/ezyang
2022-11-30 06:05:47 +00:00
447283752c Update DDP docs for Dynamo/DDPOptimizer (#89096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89096
Approved by: https://github.com/msaroufim
2022-11-30 05:50:12 +00:00
12f98f85bc [dtensor] update README (#89800)
This PR updates README to include the RFC details
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89800
Approved by: https://github.com/mrshenli
2022-11-30 04:35:32 +00:00
b09efae3bc update subscriber list (#89799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89799
Approved by: https://github.com/mrshenli
2022-11-30 04:35:32 +00:00
f4707ae004 Add arguments to collect_results (#89611)
Fixes https://github.com/pytorch/torchdynamo/issues/1901. Test script:
```python
import copy

import torch
import torch._dynamo as dynamo
import torch._dynamo.config

dynamo.config.repro_after = "dynamo"
dynamo.config.repro_level = 4

def custom_backend(gm: torch.fx.GraphModule, example_inputs):
    gm = copy.deepcopy(gm)
    for node in gm.graph.nodes:
        if len(node.args) > 1:
            node.target = torch.add
            node.args = (node.args[0], 0)
    gm.recompile()
    return gm

inp = torch.ones(5)
inp.requires_grad_(True)

@dynamo.optimize(custom_backend)
def foo(x):
    x = x * x
    return x.sum()

y = foo(inp)
print(y)
y.backward()
print(inp.grad)
```
Before, the script will finish but output an incorrect gradient. After the change, the accuracy minifier is triggered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89611
Approved by: https://github.com/ezyang
2022-11-30 04:25:33 +00:00
ce17bb95fc [FSDP] Include module classes in ModuleWrapPolicy.__repr__ (#89058)
Before:
```
<torch.distributed.fsdp.wrap.ModuleWrapPolicy object at 0x7fd4280f0fd0>
```
After:
```
<torch.distributed.fsdp.wrap.ModuleWrapPolicy object at 0x7fd4280f0fd0>({<class 'transformers.models.t5.modeling_t5.T5Block'>})
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89058
Approved by: https://github.com/mrshenli
2022-11-30 02:27:02 +00:00
c8aaad040e [FSDP] Limit all gather after pre-unshard (#89057)
To reuse memory when allocating the unsharded `FlatParameter` in the unshard stream, we only need to block the CPU thread on the preceding free event (i.e. `event.synchronize()`) before allocating the unsharded memory, which happens in `handle.unshard()`. Notably, this can be done after the pre-unshard logic, which at most performs _sharded_ allocations (low precision shard or H2D sharded `FlatParameter` copy) in its own pre-unshard stream. This enables the pre-unshard to overlap with any pending ops.

With this change, I believe that we should use `limit_all_gathers=True` all the time to stay true to FSDP's proposed memory semantics.

If a user wants to set `limit_all_gathers=False`, that would mean that he/she wants to overlap ops that are issued after the unshard logic's all-gather with ops that are pending at the time when FSDP _would_ block the CPU thread via `event.synchronize()`.
- If the user is willing to not reuse memory for that all-gather, then the user may as well have applied `NO_SHARD` and optionally ZeRO-1 (if this niche is important, then maybe we should consider hardening ZeRO-1). This is because now the unsharded memory for the all-gather additionally contributes to peak memory since it cannot reuse memory.
- If the user wanted to reuse memory for that all-gather, then we needed to block the CPU thread. There is no way around that given the caching allocator semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89057
Approved by: https://github.com/mrshenli
2022-11-30 02:27:02 +00:00
56b3ad78e1 [Checkpoint][2D][5/N] Add checkpoint_utils for distributed checkpoint to testing/_internal/distributed/ (#89873)
Moving checkpoint_utils from Tau: 6acf4054cf/spmd/testing/checkpoint_utils.py

Checkpoint_utils: add a wrapper to initialize a temp directory for checkpoint testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89873
Approved by: https://github.com/XilunWu, https://github.com/awgu, https://github.com/fduwjj
2022-11-30 02:23:30 +00:00
be80b72add [FSDP] Remove unneeded stream sync from clip_grad_norm_() (#89308)
We do not need to have the pre-unshard and unshard streams wait for the computation stream because we are not using the pre-unshard or unshard streams in `clip_grad_norm_()`.

The other change is simply avoiding a loop to get `grads`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89308
Approved by: https://github.com/mrshenli
2022-11-30 02:14:09 +00:00
90bed8874f Generator of tensor inputs with variable layout and structure (batch/non-batch, hybrid/non-hybrid, block/non-block) (#88914)
This PR introduces `TestCase.generate_simple_inputs` method that is an improved and generalized version of the `TestSparseCompressed._generate_small_inputs` method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88914
Approved by: https://github.com/cpuhrsch
2022-11-30 02:13:33 +00:00
275ade6371 Enable rsqrt (#89771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89771
Approved by: https://github.com/anijain2305
2022-11-30 02:08:13 +00:00
2d32e5dd09 add env/config flag to disable dynamo (#89828)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89828
Approved by: https://github.com/anijain2305
2022-11-30 01:59:44 +00:00
a70082a863 [functorch] Move cond.py to _cond.py and expose cond() under functorch.experimental.control_flow. (#89819)
Summary:
Similar to https://github.com/pytorch/pytorch/pull/88767 we want to reduce the chance that users
accidentally import private functions from `functorch.experimental.cond` as if they were public
interfaces. We also move `cond()` under `control_flow.py` to stay consistent with `map()` op.

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89819
Approved by: https://github.com/zou3519
2022-11-30 01:50:44 +00:00
d1760d7a42 [FSDP][Easy] Remove outdated TODO (#89217)
**Overview**
This PR removes an outdated TODO:
```
# TODO (awgu): When exposing the original parameters, we need to also
# use this attribute to prevent re-synchronizing parameters.
```

**Justification**
We only pass `managed_params` to `_sync_module_params_and_buffers()`, where `managed_params` is defined as
```
managed_params = list(_get_orig_params(root_module, state._ignored_params))
```
This `_get_orig_params()` call excludes parameters already flattened by FSDP. Thus, `_sync_module_params_and_buffers()` will not re-sync already-synchronized parameters. Each parameter appears in `managed_params` for some FSDP instance exactly once and hence is only synchronized once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89217
Approved by: https://github.com/mrshenli
2022-11-30 01:42:16 +00:00
1a33b7cbfa Make fake tensors preserve dense strides in type conversion (#89803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89803
Approved by: https://github.com/ngimel
2022-11-30 01:28:51 +00:00
9c8a94bf90 [checkpoint] Improve test (test_nested_dict.py) (#89854)
Improve the test_nested_dict.py test:
1. Add comments to show flatten_dict and mapping result.
2. Update test_mapping unit test to ensure the key value pair matching in mapping.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89854
Approved by: https://github.com/H-Huang
2022-11-30 01:13:32 +00:00
cefece3726 Fix typo in filesystem.py (#89849)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849
Approved by: https://github.com/H-Huang
2022-11-30 01:06:58 +00:00
5a79144a79 [dashboaard] Fix flag compilers (#89853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89853
Approved by: https://github.com/williamwen42
2022-11-30 01:02:36 +00:00
59a2fe74d4 [CI] Add TorchTrition conda packages (#89841)
As we need them to make triton available on both platforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89841
Approved by: https://github.com/msaroufim
2022-11-30 01:01:59 +00:00
24b3b73c98 [Caffe2] Fix merge logic bug (#89551)
Summary: `ExprGroup::getMergeCandidates()` had a logic bug. The vector being initialized had its arguments mis-ordered. This didn't trigger a build warning because the warning about implicit cast from an integral type to `bool` wasn't enabled.

Test Plan: `buck test fbsource//arvr/mode/win/vs2019/cuda11/opt fbsource//arvr/mode/hybrid_execution //arvr/libraries/neural_net_inference/TorchScript/...`

Differential Revision: D41488939

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89551
Approved by: https://github.com/davidberard98, https://github.com/jjsjann123
2022-11-30 01:01:49 +00:00
55789b40ef Remove beauby and dzdang from CODEOWNERS (#89811)
GitHub linter complained because the users no longer on the project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89811
Approved by: https://github.com/weiwangmeta
2022-11-30 01:01:24 +00:00
693135a9b8 [inductor] Add aten._native_batch_norm_legit to decomposition (#89843)
Summary: Seeing a lot of fallback warnings when running dm_nfnet_f0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89843
Approved by: https://github.com/eellison
2022-11-30 00:58:36 +00:00
3d47c74cfe Update code style for optimizer code (#89862)
Separating out whitespace-only changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89862
Approved by: https://github.com/albanD, https://github.com/soumith
2022-11-30 00:53:05 +00:00
8ca09dda42 [quant][docs] Move some of the descriptions out of codeblock (#89795)
Summary:
This is to make sure the description texts are wrapping around code, instead of being displayed as a single line

Test Plan:
visual inspections

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89795
Approved by: https://github.com/andrewor14
2022-11-30 00:32:27 +00:00
fcb5d6e771 Enable instance norm running mean test (#89793)
Followup action to https://github.com/pytorch/pytorch/pull/88697
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89793
Approved by: https://github.com/bdhirsh
2022-11-29 23:45:56 +00:00
c599cf24ad [FSDP] Another fix for DTensor, use_orig_params=True (#89845)
The issue for `test_2d_parallel.py` is that `DTensor` does not support the idiom `param.data = view` where `view` is a `DTensor`. To work around this, we do not preserve the parameter variable `param` and instead create a new parameter variable altogether via `nn.Parameter(view)`. Preserving the parameter variable when unsharded was not a strict requirement -- it just made sense to do that if we are already doing that when _sharded_, where it _is_ a strict requirement to support the optimizer step. The sharded case is not an issue for 2D because sharded implies local tensor, not `DTensor`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89845
Approved by: https://github.com/zhaojuanmao
2022-11-29 22:29:41 +00:00
b9afa92827 replace double transpose with single permute in nn.f.mha (#89847)
# Summary

I forgot about permute which was exactly what I wanted. Quick perf bump
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89847
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2022-11-29 22:18:42 +00:00
8713119c89 Stream actually overrides __new__ so we need to patch it as well (#89592)
Avoids
```
$ python foo.py
Traceback (most recent call last):
  File "foo.py", line 3, in <module>
    a = torch.cuda.Stream()
  File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
    return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
TypeError: object.__new__() takes exactly one argument (the type to instantiate)
```
And now gets
```
$ python foo.py
Traceback (most recent call last):
  File "foo.py", line 3, in <module>
    a = torch.cuda.Stream()
  File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/streams.py", line 34, in __new__
    return super(Stream, cls).__new__(cls, priority=priority, **kwargs)
  File "/home/albandes/local/pytorch/3.8_debug_source/torch/cuda/_utils.py", line 44, in err_fn
    raise RuntimeError(
RuntimeError: Tried to instantiate dummy base class Stream

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89592
Approved by: https://github.com/soumith
2022-11-29 21:43:23 +00:00
a029ec2c88 Move gpu slow tests to sm86 (#87880)
NVFuser tests (which are slow tests) would be better to run on more
modern GPU hardware.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87880
Approved by: https://github.com/malfet
2022-11-29 19:29:59 +00:00
991028cd9f Deprecating DataPipes (#89794)
Summary: per title

Test Plan:
`buck2 test buck2 test //caffe2/test:datapipe` https://www.internalfb.com/intern/testinfra/testconsole/testrun/6473924589747074/
`buck2 test mode/opt //pytorch/data/test:tests`

Differential Revision: D41563765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89794
Approved by: https://github.com/wenleix, https://github.com/NivekT
2022-11-29 19:21:53 +00:00
6c1fb3f21d Don't unsafely clone autograd meta (#89720)
Addresses this CR comment https://github.com/pytorch/pytorch/pull/88817/files#r1024618045

This appears to fix Dynamo+DDP+hf_BERT test but I don't
know how to make a minimum reproducer.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89720
Approved by: https://github.com/soumith, https://github.com/bdhirsh, https://github.com/malfet
2022-11-29 18:59:34 +00:00
02e2eaa9c6 Fix CopySlices logic to ensure wrapped node runs properly. (#89812)
This should remove the failures seen by https://github.com/pytorch/pytorch/pull/89720 in functionalization
Locally verified that running the following on top of this PR does pass: `python benchmarks/dynamo/huggingface.py --accuracy --backend aot_eager --training --only MobileBertForMaskedLM`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89812
Approved by: https://github.com/soumith, https://github.com/voznesenskym, https://github.com/ezyang
2022-11-29 18:44:28 +00:00
8314d403a6 [test_nn] split multihead_attention from test_nn (#89748)
Ref: https://github.com/pytorch/pytorch/issues/63085
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89748
Approved by: https://github.com/albanD
2022-11-29 18:15:18 +00:00
fb47a66989 [Quant][docs] Use get_default_qconfig_mapping (#87299)
Summary: The recommended way to use QConfigMapping is through
`get_default_qconfig_mapping`. However, the docs still references
usages that use `QConfigMapping().set_global(...)`. This doesn't
actually work well in practice when the model has fixed qparams
ops for example. This commit updates these usages.

Reviewers: vkuzo

Subscribers: vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87299
Approved by: https://github.com/jerryzh168
2022-11-29 18:08:16 +00:00
2bce6d09ee [Quant][fx][bc-breaking] Remove backend_config_utils.py (#89810)
Summary: Previously under torch/ao/quantization we have
backend_config/utils.py and fx/backend_config_utils.py, which
was confusing. This commit deletes the latter and moves
everything there to more suitable util files.

BC-breaking note: The following public APIs under the
`torch.ao.quantization.fx.backend_config_utils` namespace
are removed in this commit.

```
get_quantize_handler_cls
get_fusion_pattern_to_fuse_handler_cls
get_native_quant_patterns
get_pattern_to_quantize_handlers
```

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89810
Approved by: https://github.com/jerryzh168
2022-11-29 18:01:40 +00:00
e1dbd9a288 Revert "[GHA] Decrease Windows test timeout to 120 minutes (#89694)"
This reverts commit faa032c5e58502de6ea461e531109d2acc22e56a.

Reverted https://github.com/pytorch/pytorch/pull/89694 on behalf of https://github.com/clee2000 due to broke periodic b/c they take ~2.5 hrs, also broke mem leak check b/c its slow, should probably look into having this be a parameter
2022-11-29 17:55:43 +00:00
6e2da426f0 [FSDP] Relax post-backward assert (#89791)
This assert was accidentally made stricter when transitioning from per-FSDP-instance training state to per-handle training state. This PR relaxes it again, which should restore compatibility for some reentrant AC plus FSDP cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89791
Approved by: https://github.com/zhaojuanmao
2022-11-29 17:25:56 +00:00
218d9c6e09 Revert "Move functorch/_src to torch/_functorch (#88756)"
This reverts commit 52bc5c1cfe098fd4b4b13902b4fea83b455b9773.

Reverted https://github.com/pytorch/pytorch/pull/88756 on behalf of https://github.com/clee2000 due to broke imports in tests 52bc5c1cfe https://github.com/pytorch/pytorch/actions/runs/3574742513/jobs/6010814968 probably a landrace
2022-11-29 17:17:11 +00:00
086b251f9a [follow-up] Python Attr Serialization (#88913)
Ref: https://github.com/pytorch/pytorch/pull/81616#issuecomment-1307595402
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88913
Approved by: https://github.com/albanD
2022-11-29 16:46:20 +00:00
2f9ec226e4 don't run input mutation analysis in dynamo (#89760)
Right now we're running the analysis pass and then discarding the result. Instead, we should just stop running the analysis pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89760
Approved by: https://github.com/soumith, https://github.com/ezyang
2022-11-29 16:40:06 +00:00
3cef87f9fd [aarch64] add SLEEF dependency for aten_cpu (#89475)
Reviewed By: kimishpatel, dmm-fb

Differential Revision: D41350031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89475
Approved by: https://github.com/kimishpatel, https://github.com/ezyang
2022-11-29 15:17:58 +00:00
c6ede0bdfc [Quant][docs] Fix BackendConfig example in docstring/README (#89319)
Summary: The example in the BackendConfig docstring and the README
was not runnable. This fixes a typo (`bias_type` -> `bias_dtype`),
removes the call to an internal helper function, and adds an
additional BackendPatternConfig to make the example BackendConfig
more realistic and useful.

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89319
Approved by: https://github.com/jerryzh168
2022-11-29 15:11:40 +00:00
52bc5c1cfe Move functorch/_src to torch/_functorch (#88756)
This will be the last disruptive functorch internals change.

Why are we moving these files?
- As a part of rationalizing functorch we are moving the code in
functorch/_src to torch/_functorch
- This is so that we can offer the functorch APIs as native PyTorch APIs
(coming soon) and resolve some internal build issues.

Why are we moving all of these files at once?
- It's better to break developers all at once rather than many times

Test Plan:
- wait for tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88756
Approved by: https://github.com/ezyang
2022-11-29 13:55:42 +00:00
620994cd7a Guard the boundary of index computed in compute_source_index_and_lambda (#89252)
Improve the fix in https://github.com/pytorch/pytorch/pull/89210
See discussion in https://github.com/pytorch/pytorch/issues/89212#issuecomment-1318911969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89252
Approved by: https://github.com/mingfeima, https://github.com/weiwangmeta
2022-11-29 13:55:22 +00:00
93772305d9 [PyTorch Edge] Set training for module only (#89488)
Update previous recursive logic.

Continue setting training attribute only if the slot is an object and a module.

For the corresponding JIT module, they get the module list first and set module one by one. there is method to get all modules iteratively, instead of recursively.

This change patch one fix to set training attribute for `model_f269583363.ptl`. Another patch is needed, because current lite interpreter doesn't have the correct type when loading object with setstate.

Differential Revision: [D41466417](https://our.internmc.facebook.com/intern/diff/D41466417/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89488
Approved by: https://github.com/iseeyuan
2022-11-29 13:49:44 +00:00
a78467f3df Refactoring to share vectorization code for int8/uint8. (#89650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89650
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10
2022-11-29 12:47:48 +00:00
8226a5d383 [minifier] Continue on assertion for accuracy minification (#89739)
During accuracy minification, minifier can create graphs which can cause assertion failures. This PR catches such assertions and let minifier move on, instead of getting stuck in minifying this issue.

It is possible that such graphs point to some real-although-unrelated issue. So, printing an assertion to flag and debug if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89739
Approved by: https://github.com/mlazos
2022-11-29 07:49:07 +00:00
40dd03eeaa [dynamo] Don't copy the graph during checkpointing (copy_graphstate) (#89232)
copy_graphstate is called a ton, this makes copy_graphstate a lot faster, helps with https://github.com/pytorch/torchdynamo/issues/1803

tag each graph node with a timestamp, when checkpointing store the timestamp, when restoring remove nodes older than the timestamp stored in the state. This essentially has the same behavior as the original impl, just doesn't copy the whole graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89232
Approved by: https://github.com/jansel
2022-11-29 07:19:02 +00:00
91899a9ebd add memory_tracker tool to help profiling memory usages (#88825)
Adding a memory_tracker API to show operator level memory traces for allocated_memory, active_memory and reserved memory stats, it gave the summary about top 20 operators that generate memories as well.

The implementation mainly uses torchDispatchMode and module hooks to get traces and add markers.

Will add following up PRs:
1. allow tracing more than 1 iteration
2. dump json data for visualization
3. add unit test for DDP training
4. add unit test for FSDP training
5. add unit test for activation checkpointing + DDP/FSDP training
6. add traces for activation memories and top operators that generate activation memories
7. print summaries for more breakdowns like model size, optimizer states, etc
8. add traces for temporary memories or memories consumed by cuda streams or nccl library if possible
9. connect the tool with OOM memory debugging
10. add dynamic programming (dp) algorithm to find best activation checkpointing locations based on the operator level activation memory traces
11. add same traces & dp algorithm for module level memory stats, as FSDP wrapping depends on module level memories, for some model users/not model authors, if they have to apply activation checkpointing on module level, they need module level memory traces as well

======================================================

Current test result for the memory_tracker_example.py on notebook:

Top 20 ops that generates memory are:
bn1.forward.cudnn_batch_norm.default_0: 98.0009765625MB
maxpool.forward.max_pool2d_with_indices.default_0: 74.5MB
layer1.0.conv1.backward.max_pool2d_with_indices_backward.default_0: 49.0MB
layer1.0.bn1.forward.cudnn_batch_norm.default_1: 24.5009765625MB
layer1.0.bn2.forward.cudnn_batch_norm.default_2: 24.5009765625MB
layer1.1.bn1.forward.cudnn_batch_norm.default_3: 24.5009765625MB
layer1.1.bn2.forward.cudnn_batch_norm.default_4: 24.5009765625MB
layer1.2.bn1.forward.cudnn_batch_norm.default_5: 24.5009765625MB
layer1.2.bn2.forward.cudnn_batch_norm.default_6: 24.5009765625MB
layer1.0.conv1.forward.convolution.default_1: 24.5MB
layer1.0.conv2.forward.convolution.default_2: 24.5MB
layer1.1.conv1.forward.convolution.default_3: 24.5MB
layer1.1.conv2.forward.convolution.default_4: 24.5MB
layer1.2.conv1.forward.convolution.default_5: 24.5MB
layer1.2.conv2.forward.convolution.default_6: 24.5MB
maxpool.backward.threshold_backward.default_32: 23.5MB
layer2.0.downsample.backward.convolution_backward.default_26: 12.2802734375MB
layer2.0.bn1.forward.cudnn_batch_norm.default_7: 12.2509765625MB
layer2.0.bn2.forward.cudnn_batch_norm.default_8: 12.2509765625MB
layer2.0.downsample.1.forward.cudnn_batch_norm.default_9: 12.2509765625MB

<img width="1079" alt="Screen Shot 2022-11-10 at 10 03 06 AM" src="https://user-images.githubusercontent.com/48731194/201172577-ddfb769c-fb0f-4962-80df-92456b77903e.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88825
Approved by: https://github.com/awgu
2022-11-29 06:42:57 +00:00
7ec7a82082 Test FSDP with submodule non-reentrant checkpointing (#89781)
With combining FSDP with reentrant checkpointing, the post backward
hook might run twice, and then hit [this
error](e20ec44544/torch/distributed/fsdp/_runtime_utils.py (L487)).
This is because reentrant backward uses nested autograd GraphTasks.
The inner GraphTask is not aware of the outer one and therefore
will flush pending `AccumulateGrad` invocations on exit, which in
turn triggers the post backward hooks registered by FSDP. Later,
the outer GraphTask will trigger that again, leading to the above
error.

PR #89791 relaxes the FSDP training state check, but we still run
into grad value check failures occasionally. Therefore, this PR only
lands the test for non-reentrant test, and we can enable the
reentrant test when the accuracy issues are addressed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89781
Approved by: https://github.com/rohan-varma
2022-11-29 05:34:34 +00:00
705ad36cc5 Dynamo asserts FSDP wrapped modules use_orig_param (#89523)
- This is a strict requirement given the way dynamo+FSDP is implemented,
  but isn't convenient to assert.
- By plumbing use_orig_param field on all wrapped modules, we can
  do this assertion inside dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89523
Approved by: https://github.com/awgu
2022-11-29 05:27:23 +00:00
7860fcc245 Enable DDPOptimizer by default in dynamo (#88523)
Performance benchmarks on 6 popular models from 1-64 GPUs compiled with
torchinductor show performance gains or parity with eager, and showed
regressions without DDPOptimizer.  *Note: resnet50 with small batch size shows a regression with optimizer, in part due to failing to compile one subgraph due to input mutation, which will be fixed.
(hf_Bert, hf_T5_large, hf_T5, hf_GPT2_large, timm_vision_transformer, resnet50)

Correctness checks are implemented in CI (test_dynamo_distributed.py),
via single-gpu benchmark scripts iterating over many models
(benchmarks/dynamo/torchbench.py/timm_models.py/huggingface.py),
and via (multi-gpu benchmark scripts in torchbench)[https://github.com/pytorch/benchmark/tree/main/userbenchmark/ddp_experiments].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88523
Approved by: https://github.com/davidberard98
2022-11-29 05:27:06 +00:00
9048cf16fe Move Dynamo docs back to core (#89769)
With contributions from @svekars and @malfet

Waiting for doc build job to complete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89769
Approved by: https://github.com/soumith, https://github.com/malfet
2022-11-29 04:38:53 +00:00
2b522670d2 [dynamo] Minifier fixes for reproducing segfault (#89712)
Helped with minifying the segfault in https://github.com/pytorch/torchdynamo/issues/1928

Tests not really needed. It improves quality of life as segfault can fail anywhere (when CUDA_LAUNCH_BLOCKING is off)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89712
Approved by: https://github.com/mlazos, https://github.com/ngimel
2022-11-29 04:29:42 +00:00
c1950620c5 [decomp] Fix native_batch_norm_backward dtype of dweight and dbias (#89740)
Discovered while debugging an accuracy issue for Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89740
Approved by: https://github.com/soumith, https://github.com/ngimel
2022-11-29 03:15:20 +00:00
4d7ec30220 Call _sdp_attention in nn.functional.mha (#89470)
# Summary
Replaces the the inline block of code in nn.funcitonal.mha with `_scaled_dot_product_attention`. This function allows the fused kernels to be called if all the required input conditions are met.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89470
Approved by: https://github.com/cpuhrsch, https://github.com/mikekgfb
2022-11-29 03:02:10 +00:00
e20ec44544 fixes for inductor <> batch norm (#89603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89603
Approved by: https://github.com/albanD
2022-11-29 02:16:52 +00:00
740860d414 Add type hint to torch.norm and Tensor.norm (#89728)
Fixes #89727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89728
Approved by: https://github.com/kit1980
2022-11-29 02:09:51 +00:00
908daa8ae5 [nvfuser] avoid out of bounds error (#89584)
Summary: update OOB check (https://github.com/csarofeen/pytorch/pull/2218) and skip tests that OOM on internal machines.

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/torch/csrc/jit/codegen/cuda/test:nvfuser
```

Differential Revision: D41502369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89584
Approved by: https://github.com/jjsjann123
2022-11-29 02:03:59 +00:00
77df2ca9b6 Special-case fsdp wrapped modules to be Unspecialized (#89330)
### Summary
Making dynamo treat the nn.Modules inside FSDP wrappers as 'Unspecialized'
results in dynamo-produced graphs where nn.module parameters are inputs
to the graph rather than attributes of the outer graphmodule.

This helps in FSDP since it forces dynamo to pick the latest copy
of the parameters off the user's nn.Module (which FSDP mutates every pre_forward),
solving the ordering issue in backward.

### Details
Imagine this toy model
```
class MyModule(torch.nn.Module):
    def __init__(self, a, b):
        super(MyModule, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(a, b),
            nn.ReLU(),
        )
    def forward(self, x):
        return self.net(x)

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net = nn.Sequential(
            *[MyModule(10, 10000)]
            + [MyModule(10000, 1000)]
            + [MyModule(1000, 5)]
        )

    def forward(self, x):
        return self.net(x)
```
Where FSDP is recursively wrapped around each `MyModule`, then dynamo-compiled, with dynamo already configured to skip/break in FSDP code.  You'd expect to get 3 compiled AOT functions, corresponding to the contents of `MyModule`, and then see FSDP's communication ops happen inbetween them (eagerly).  This almost happens (everything works out fine in forward), but in backward there is an ordering issue.

FSDP creates a flat buffer for all the parameters that are bucketed together, and then creates views into this buffer to replace the original parameters.  On each iteration of forward, it creates a new view after 'filling' the flatbuffer with data from an all-gather operation, to 'unshard' the parameters from remote devices.  Dynamo traces the first such view and stores it in a compiled graphmodule.

During  tracing, we see (1) view created for first MyModule, (2) compile first MyModule, (3) ... for the rest of layers

Then during runtime,  we see (A)  view created for first MyModule (and orphaned), (B) execute first compiled MyModule, using old view, ...

This is a problem, because we want backward hooks to run right after each compiled-backward, but autograd executes those hooks in an order mirroring their execution order during forward.  Since we are forever using the views created during steps (1, 3, ..  N), which all happen before the steps (A, B, ...),  this means that all the hooks will happen after all the compiled backwards.  An illustration of the problem - a torchviz graph showing the 2 possible orderings of autograd, and a profile showing the view-backwards ops happening after all the compiled backwards, and before all the backward hooks.

<img width="2069" alt="image" src="https://user-images.githubusercontent.com/4984825/202828002-32dbbd15-8fc3-4281-93e9-227ab5e32683.png">
<img width="2069" alt="image" src="https://user-images.githubusercontent.com/4984825/202828632-33e40729-9a7f-4e68-9ce1-571e3a8dd2dd.png">

A solution is to make dynamo not specialize on these nn modules.  It is worth pointing out that this nn.module specialization is de-facto failing, as we are modifying .parameters and this bypasses dynamo's __setattr__ monkeypatch, which should have automatically kicked us out to Unspecialized and forced a recompile.

After unspecializing, the new views (created during steps A,  C, ...) are actually _used_ at runtime by the module, making their creation order interleaved, making autograd execute their backwards interleaved.

The new torchviz graph (this time with names added for the view tensors):
<img width="2043" alt="image" src="https://user-images.githubusercontent.com/4984825/202828480-d30005ba-0d20-45d8-b647-30b7ff5e91d3.png">

And a new profile showing the interleaving of compiled backwards and hooks, allowing overlapping of reduce-scatter.
<img width="2293" alt="image" src="https://user-images.githubusercontent.com/4984825/202828533-bb20a041-19b8-499c-b3cf-02808933df47.png">

@jansel @davidberard98 @aazzolini @mrshenli @awgu @ezyang @soumith @voznesenskym @anijain2305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89330
Approved by: https://github.com/davidberard98
2022-11-29 01:24:03 +00:00
c75434ed4f [Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674)
This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674
Approved by: https://github.com/jansel
2022-11-29 00:58:46 +00:00
cyy
4b11119cc3 [functorch] fix possible overflow (#83389)
Fix some errors detected by static analysis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83389
Approved by: https://github.com/zou3519
2022-11-29 00:55:34 +00:00
63843401f5 Fix archive issue impacting summary stat diff (#89789)
Summary stat diff was reporting diff between previous day and the day before that, instead of today and previous day. Issue was because summary stats were not uploaded to the archive before the summary stat differ was run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89789
Approved by: https://github.com/anijain2305
2022-11-29 00:55:06 +00:00
943acd4d27 [FSDP] Fix nn.Parameter usage for 2D and use_orig_params=True (#89782)
This ensures that all elements of `FlatParameter._params` and `FlatParameter._shared_params` are `nn.Parameter`s (as expected). This was violated by the local tensor of a `DTensor` when using 2D parallelism. To fix the breakage, we simply wrap with `nn.Parameter` if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89782
Approved by: https://github.com/fduwjj
2022-11-28 23:56:38 +00:00
23ee6757fc [Checkpoint][2D][4/N] Add nested_dict for distributed checkpoint to core distributed (#89537)
This PR moves nested_dict and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This provides the functionality to flatten a nested dict and unflatten a flattened dict.

Docstring will be added in the following PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89537
Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2022-11-28 23:49:17 +00:00
a378ba2123 Re-enabled 3 reductions tests on Windows (#89567)
With PR #88089 the test_ref_small_input_masked_prod with int8,int16 and int32 tests no longer overflows on Windows so they can be re-enable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89567
Approved by: https://github.com/cpuhrsch
2022-11-28 23:41:54 +00:00
f3b1315eee Add bits tensor types (#88594)
TODO (in later PRs)
- [ ] the other bits8, 4x2, 2x4, 1x8
- [ ] bits printer function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88594
Approved by: https://github.com/ezyang
2022-11-28 23:39:57 +00:00
22e7514a15 [Checkpoint][2D][3/N] Add nested_tensors for distributed checkpoint to core distributed (#89501)
This PR moves nested_tensors to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This flattens sharded tensors in state_dict. It is used when saving and loading FSDP SHARDED_STATE_DICT.

Docstring, individual and integration test will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89501
Approved by: https://github.com/wanchaol
2022-11-28 23:21:38 +00:00
0057be3361 [CUDA graphs] Add warning if captured graph is empty (#88754)
Fixes #87894

This PR adds a warning if captured graph is empty (consists of zero nodes).
The example snippet where would it be useful:

```python
import torch

x = torch.randn(10)
z = torch.zeros(10)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    z = x * x
# Warn user
```

and in #87894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88754
Approved by: https://github.com/ezyang
2022-11-28 23:20:19 +00:00
c18da597e0 [skip ci] documentation update for the kwargs defaults section of fun… (#89719)
In this doc, it's better to multiply the scale instead of the constant 4.0 to illustrate the default of kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89719
Approved by: https://github.com/kit1980, https://github.com/malfet
2022-11-28 21:49:26 +00:00
13d2af2a9b [LTC] Metrics can be reset too (#89606)
Summary:
This change allow MetricsArena to ResetMetrics too. And then rename Reset to ResetCounters given that's what it does for real.

This matches pytorch/xla#4109, and is paired with pytorch/xla#4245.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89606
Approved by: https://github.com/JackCaoG
2022-11-28 21:44:12 +00:00
5abe454d6c [Vulkan][TCC] Fix conv2d pack biases (#89568)
Summary: Fixed bug on pack_biases, where the weight scale and zero point were being assigned to the bias.

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: SS-JIA

Differential Revision: D41350358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89568
Approved by: https://github.com/salilsdesai
2022-11-28 21:36:01 +00:00
2e0cd7c8bd Add meta implementation for _efficientzerotensor (#88936)
`_efficientzerotensor` is used in several backwards formulas, so its
lack of meta implementation makes those functions untracable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88936
Approved by: https://github.com/anjali411
2022-11-28 21:24:12 +00:00
69a8c92d53 Fix comparison of batched_prop vs unbatched_prob in test_distributions (#87977)
When using SciPy >= 1.7 wishart_log_prob runs into singular samples which means there are `inf`s in `batched_prop` and `unbatched_prop`.
The difference of 2 `inf`s is `nan` which will fail the `equal(0` check.
However passing the tensors directly to `assertEqual` is not only supported but the correct way as it will handle `inf` values etc.

Change the same code in 2 more tests:
  - test_multivariate_normal_log_prob
  - test_lowrank_multivariate_normal_log_prob
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87977
Approved by: https://github.com/soulitzer
2022-11-28 21:15:21 +00:00
47cca5e444 Revert "Move Dynamo docs back to core (#89769)"
This reverts commit be2816db181cc4d9a1822feb1202dbd2e8c87918.

Reverted https://github.com/pytorch/pytorch/pull/89769 on behalf of https://github.com/clee2000 due to broke lint
2022-11-28 21:04:33 +00:00
eqy
8321066031 Tweak formatting of note on macros (#89598)
For readability when viewing the rendered file e.g., from the browser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89598
Approved by: https://github.com/kit1980
2022-11-28 20:42:30 +00:00
be2816db18 Move Dynamo docs back to core (#89769)
With contributions from @svekars and @malfet

Waiting for doc build job to complete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89769
Approved by: https://github.com/soumith
2022-11-28 20:32:05 +00:00
465ee7bc09 [inductor] skip dm_nfnet_f0 in TIMM model test (#89768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89768
Approved by: https://github.com/clee2000
2022-11-28 20:08:41 +00:00
cdf4087597 [benchmarks] Disabling gradscaler (#89741)
Disabling Gradscaler because
 1) Benchmark setup runs 2 iterations of fwd-bwd. So, not useful.
 2) Current setup shares grad_scaler for eager and dynamo model,
 which is bad as Gradscaler has state and can adjust the scaling
 factor between eager and dynamo run, making accuracy check
 harder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89741
Approved by: https://github.com/ngimel
2022-11-28 20:08:37 +00:00
e8643ded6d Revert "Don't allow recomputing a node that *must* be materialized in the backwards pass (#89171)" (#89770)
This reverts commit e36d68af8885f27d8c0b4727ab078bf53e55e7a0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89770
Approved by: https://github.com/anijain2305
2022-11-28 20:02:07 +00:00
2a2c07ae37 [multipy] Address GetPythonFramesFunction() and multipy incompatibility. (#267) (#89315)
Summary:
https://github.com/pytorch/pytorch/pull/89122 introduces internal compatibility issues with torchdeploy. However, GetPythonFramesFunction() never worked with torchdeploy, so this PR simply reverts to the original behavior of skipping the function if torchdeploy is used as a forward fix.

Test Plan:
Running failed tests in T128123281
```
buck2 test @//mode/opt //multipy/runtime:test_deploy -- --exact 'multipy/runtime:test_deploy - TorchpyTest.TaggingRace' --run-disabled

buck2 test mode/dev //multipy/runtime/testdev:test_deploy_from_python -- --exact 'multipy/runtime/testdev:test_deploy_from_python - multipy.runtime.testdev.test_deploy_from_python.TestDeployFromPython: test_deploy_from_python'
```

Differential Revision: D41414263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89315
Approved by: https://github.com/kurman
2022-11-28 19:36:45 +00:00
95563b3eda Reland "Add single process version of dynamo distributed hf_Bert tests (#89721)" (#89756)
This reverts commit 0d9a615af4007014586c946cb8ffcc911d4100f6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89756
Approved by: https://github.com/anjali411, https://github.com/malfet
2022-11-28 19:15:03 +00:00
6ef702490d Revert "Support set_rng_state with fake tensor (#89642)"
This reverts commit 2f8769d680f068cb97a829d7582fac1cdea21753.

Reverted https://github.com/pytorch/pytorch/pull/89642 on behalf of https://github.com/ezyang due to elias is right this is probably wrong
2022-11-28 19:13:33 +00:00
ed41a7fb68 Update minor release acceptance criteria (#89767)
Update minor release acceptance criteria

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89767
Approved by: https://github.com/albanD, https://github.com/weiwangmeta
2022-11-28 18:49:32 +00:00
ed9cd47e31 Add AOTAutograd and partitioner to ciflow/inductor (#89772)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89772
Approved by: https://github.com/albanD
2022-11-28 18:39:42 +00:00
cf91e3641a Use isinstance test rather than exact type test for wrap to fake (#89671)
I'm not sure why we did an exact test originally.  Let's find out!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89671
Approved by: https://github.com/voznesenskym
2022-11-28 18:39:18 +00:00
b87c45d5a7 Make aot_module_simplified accept fake tensors (#89670)
Strategy taken from voz's #89392 but my implementation strategy
is a bit different.

If a fake tensor is provided, we use its FakeTensorMode
(and more importantly, its ShapeEnv--this is what is tested
in the new unit test).  Only one tensor needs to be fake;
if nothing is fake we just make a fresh mode as before.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89670
Approved by: https://github.com/voznesenskym
2022-11-28 18:39:18 +00:00
abf91562bd Change aot_module_simplified to take take arguments directly (#89669)
This is extracted from voz's #89392

Previously, the implementation did some half-assed caching where it
returned a callable, that when invoked for the first time, actually
performed the compilation.  Delaying the compilation like this...
seems totally unnecessary?  To make matters worse, this has cost
(we have to check if we hit the cache) and unsound (because the
compiled function may not be valid for other arguments.)

So instead, we ask user to provide arguments, and compile everything
immediately.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89669
Approved by: https://github.com/voznesenskym, https://github.com/Chillee
2022-11-28 18:39:15 +00:00
b589e726d9 Refactor how AOTAutograd backends are defined (#89736)
There was a lot of strangeness in how AOTAutograd backends were previously defined. This refactor replaces the strangeness with something simple and straightforward. The improvements:

- There is no longer a footgun aot_autograd "backend" which doesn't actually work. No more mistyping `torch._dynamo.optimize("aot_autograd")` when you meant "aot_eager"
- Deleted aot_print because it's annoying and anyway there's no uses of it
- Instead of having BOTH the backend Subgraph and AotAutogradStrategy, there is now only an aot_autograd function which takes the kwargs to configure AOTAutograd, and then gives you a compiler function that does AOTAutograd given those kwargs. Easy.
- The primary downside is that we are now eagerly populating all of the kwargs, and that can get us into import cycle shenanigans. Some cycles I resolved directly (e.g., we now no longer manually disable the forward function before passing it to aot_autograd; aot_autograd it does it for us), but for getting inductor decompositions I had to make it take a lambda so I could lazily populate the decomps later.

New code is 130 lines shorter!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89736
Approved by: https://github.com/anjali411, https://github.com/albanD
2022-11-28 18:39:12 +00:00
cf4969d9d6 [ROCm] Replace layer_norm_grad_input_kernel with cuComputeGradInput for ROCm (#87726)
We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when fs (=config_m in our benchmark script) is large and bs (=config_n in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of https://github.com/pytorch/pytorch/pull/68238#issue-1051621716 on AMD GPUs.

This PR is to replace layer_norm_grad_input_kernel with the Apex cuComputeGradInput kernel with some ROCm-specific parameter tuning when fs (=config_m) is larger than or equal to `32768` on AMD GPUs. Some of the code changes in LayerNormBackwardKernelImplInternal are from another PR: https://github.com/pytorch/pytorch/pull/87635

We used the same benchmark script in the previous PR and tested the optimized kernel with various input shapes on AMD MI100 GPU.

**At [the previous PR](https://github.com/pytorch/pytorch/pull/87635):**
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl65
	{color:windowtext;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
-- | -- | -- | -- | -- | --
50432 | 384 | 0.38589 | 0.92603 | 0.38367 | 1.15148
50176 | 384 | 0.38719 | 0.91579 | 0.37815 | 1.13761
200704 | 192 | 0.99787 | 2.39954 | 0.98996 | 2.54284
802816 | 64 | 3.66525 | 7.96952 | 3.61293 | 7.69946
200 | 256 | 0.06578 | 0.34613 | 0.06966 | 0.35449
1000 | 256 | 0.07837 | 0.37631 | 0.07725 | 0.37758
6000 | 256 | 0.09318 | 0.3788 | 0.09202 | 0.37989
6272 | 256 | 0.08694 | 0.36267 | 0.08703 | 0.3615
200 | 512 | 0.06975 | 0.34506 | 0.06973 | 0.34208
1000 | 512 | 0.07012 | 0.36363 | 0.07307 | 0.36741
6000 | 512 | 0.09725 | 0.36251 | 0.09908 | 0.37078
6272 | 512 | 0.09899 | 0.36519 | 0.10068 | 0.37514
200 | 1024 | 0.07188 | 0.33896 | 0.0712 | 0.34683
1000 | 1024 | 0.07357 | 0.3625 | 0.0734 | 0.3598
6000 | 1024 | 0.12642 | 0.38949 | 0.12973 | 0.5035
6272 | 1024 | 0.12901 | 0.40759 | 0.13609 | 0.51871
200 | 1536 | 0.06998 | 0.34782 | 0.07419 | 0.3514
1000 | 1536 | 0.07987 | 0.37915 | 0.07888 | 0.37264
6000 | 1536 | 0.15401 | 0.47524 | 0.15416 | 0.68609
6272 | 1536 | 0.15286 | 0.48843 | 0.17681 | 0.72997
200 | 2048 | 0.07054 | 0.34791 | 0.07289 | 0.35138
1000 | 2048 | 0.07767 | 0.37954 | 0.08554 | 0.37464
6000 | 2048 | 0.18744 | 0.5811 | 0.25004 | 0.93338
6272 | 2048 | 0.20037 | 0.63398 | 0.26918 | 0.97018
200 | 3072 | 0.07687 | 0.36739 | 0.08917 | 0.37845
1000 | 3072 | 0.09323 | 0.38901 | 0.09739 | 0.39823
6000 | 3072 | 0.24314 | 0.89029 | 0.38093 | 1.30719
6272 | 3072 | 0.26079 | 0.92023 | 0.38352 | 1.51012
128 | 2097152 | 6.17775 | 23.876 | 10.27952 | 30.10848
256 | 1048576 | 4.51855 | 19.47637 | 10.07609 | 29.42678
512 | 524288 | 4.13615 | 18.80888 | 10.07853 | 32.29804
1024 | 262144 | 4.47397 | 17.88388 | 9.50367 | 31.15699
2048 | 131072 | 4.2458 | 16.70852 | 9.17979 | 30.51708
4096 | 65536 | 4.24412 | 16.43098 | 8.97651 | 30.1617
8192 | 32768 | 4.24556 | 16.09038 | 8.77001 | 30.3643
16384 | 16384 | 4.14642 | 15.80355 | 8.82402 | 30.35291
32768 | 8192 | 4.12599 | 15.68897 | 8.82605 | 30.43423

</body>

</html>

----

**At this PR:**

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl65
	{color:windowtext;}
.xl66
	{background:yellow;
	mso-pattern:black none;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
-- | -- | -- | -- | -- | --
50432 | 384 | 0.38667 | 0.84133 | 0.37916 | 1.01222
50176 | 384 | 0.3814 | 0.87266 | 0.37858 | 1.04399
200704 | 192 | 0.99902 | 2.14386 | 0.98973 | 2.33265
802816 | 64 | 3.66578 | 6.85376 | 3.6092 | 7.00331
200 | 256 | 0.06607 | 0.34176 | 0.07009 | 0.34548
1000 | 256 | 0.06947 | 0.36461 | 0.07902 | 0.37851
6000 | 256 | 0.09319 | 0.37432 | 0.09342 | 0.36927
6272 | 256 | 0.09544 | 0.37565 | 0.09476 | 0.37377
200 | 512 | 0.07935 | 0.364 | 0.07891 | 0.36894
1000 | 512 | 0.07676 | 0.37552 | 0.07957 | 0.37564
6000 | 512 | 0.10472 | 0.37504 | 0.1051 | 0.38782
6272 | 512 | 0.1069 | 0.36662 | 0.10062 | 0.38506
200 | 1024 | 0.07793 | 0.36561 | 0.08023 | 0.35019
1000 | 1024 | 0.07426 | 0.36729 | 0.07345 | 0.35851
6000 | 1024 | 0.12729 | 0.39219 | 0.12974 | 0.51526
6272 | 1024 | 0.13622 | 0.41627 | 0.14252 | 0.52926
200 | 1536 | 0.07615 | 0.36621 | 0.0797 | 0.3695
1000 | 1536 | 0.08327 | 0.38174 | 0.07938 | 0.37573
6000 | 1536 | 0.14894 | 0.46197 | 0.15268 | 0.63814
6272 | 1536 | 0.15368 | 0.48818 | 0.16309 | 0.71441
200 | 2048 | 0.06935 | 0.36691 | 0.07258 | 0.35548
1000 | 2048 | 0.07738 | 0.36388 | 0.08036 | 0.36452
6000 | 2048 | 0.18757 | 0.58573 | 0.23701 | 0.92915
6272 | 2048 | 0.1938 | 0.61628 | 0.26475 | 0.96896
200 | 3072 | 0.07884 | 0.3673 | 0.07724 | 0.37869
1000 | 3072 | 0.09342 | 0.38193 | 0.09822 | 0.38646
6000 | 3072 | 0.24452 | 0.86776 | 0.38251 | 1.3036
6272 | 3072 | 0.25971 | 0.91053 | 0.38744 | 1.39039
128 | 2097152 | 6.06752 | 23.26379 | 9.87466 | 29.81851
256 | 1048576 | 4.50336 | 19.4614 | 10.11239 | 29.25554
512 | 524288 | 4.12649 | 18.72831 | 10.054 | 32.26784
1024 | 262144 | 4.40855 | 17.77993 | 9.38856 | 31.18679
2048 | 131072 | 4.18716 | 16.74615 | 9.14487 | 30.24603
4096 | 65536 | 4.17374 | 16.34444 | 8.94894 | 30.0326
8192 | 32768 | 4.19095 | 16.05751 | 8.70358 | 30.14669
16384 | 16384 | 4.15404 | 15.83771 | 8.80042 | 30.5022
32768 | 8192 | 4.12515 | 15.5657 | 8.66138 | 28.87386

</body>

</html>

---

**Performance Improvement (%)**

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl65
	{color:windowtext;}
.xl66
	{mso-number-format:"0\.000";}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwdbwd, torch.float16 | fwdbwd, torch.float32
-- | -- | -- | --
50432 | 384 | 9.147 | 12.094
50176 | 384 | 4.710 | 8.230
200704 | 192 | 10.655 | 8.266
802816 | 64 | 14.000 | 9.042
200 | 256 | 1.263 | 2.542
1000 | 256 | 3.109 | -0.246
6000 | 256 | 1.183 | 2.796
6272 | 256 | -3.579 | -3.394
200 | 512 | -5.489 | -7.852
1000 | 512 | -3.270 | -2.240
6000 | 512 | -3.456 | -4.596
6272 | 512 | -0.392 | -2.644
200 | 1024 | -7.862 | -0.969
1000 | 1024 | -1.321 | 0.359
6000 | 1024 | -0.693 | -2.336
6272 | 1024 | -2.130 | -2.034
200 | 1536 | -5.287 | -5.151
1000 | 1536 | -0.683 | -0.829
6000 | 1536 | 2.792 | 6.989
6272 | 1536 | 0.051 | 2.132
200 | 2048 | -5.461 | -1.167
1000 | 2048 | 4.126 | 2.701
6000 | 2048 | -0.797 | 0.453
6272 | 2048 | 2.792 | 0.126
200 | 3072 | 0.024 | -0.063
1000 | 3072 | 1.820 | 2.956
6000 | 3072 | 2.531 | 0.275
6272 | 3072 | 1.054 | 7.929
128 | 2097152 | 2.564 | 0.963
256 | 1048576 | 0.077 | 0.582
512 | 524288 | 0.428 | 0.094
1024 | 262144 | 0.581 | -0.096
2048 | 131072 | -0.225 | 0.888
4096 | 65536 | 0.527 | 0.428
8192 | 32768 | 0.204 | 0.717
16384 | 16384 | -0.216 | -0.492
32768 | 8192 | 0.786 | 5.127

</body>

</html>

CC: @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87726
Approved by: https://github.com/ngimel
2022-11-28 18:35:27 +00:00
098cbe23c3 Update masked.rst (#89758)
Fix https://github.com/pytorch/pytorch/issues/89734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89758
Approved by: https://github.com/anjali411, https://github.com/malfet, https://github.com/cpuhrsch
2022-11-28 17:55:43 +00:00
faa032c5e5 [GHA] Decrease Windows test timeout to 120 minutes (#89694)
This PR decreases the Windows tests pipelines timeout to 120 mins per discusison as requested at https://github.com/pytorch/pytorch/issues/73489#issuecomment-1322539593

Closes #73489.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89694
Approved by: https://github.com/kit1980
2022-11-28 17:24:53 +00:00
a37072170d [FSDP()] Require args as kwargs for fully_shard() (#89573)
I am not aware of any users of `FullyShardedDataParallel` that pass arguments after `process_group` positionally. I.e., I believe users pass arguments as keyword arguments. This PR formalizes this for `fully_shard()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89573
Approved by: https://github.com/mrshenli
2022-11-28 16:56:32 +00:00
090fc62b24 [FSDP()] Register root pre-forward hook (#89572)
- This PR registers the FSDP root pre-forward hook as a module forward pre-hook following the recently added support for kwargs for those hooks.
- This PR also passes `prepend=True` for the normal (not root) pre-forward hook. This is not strictly required for this PR, but I believe it is needed for composability with activation checkpointing. (We want to run FSDP logic on the outside and AC logic on the inside, just like how we recommend `FSDP(AC(module))` for the wrapper versions.)

Fun fact: I originally chose the `[FSDP()]` prefix in the PR titles when we still referred to composable FSDP as functional-like FSDP, in which case `FSDP()` approximated "functional FSDP". I am preserving this usage to make searching for PRs relating to composable FSDP easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89572
Approved by: https://github.com/mrshenli
2022-11-28 16:56:32 +00:00
8721448544 Add statement about minor releases, in the release.md document (#89698)
* Add statement about minor releases

* Update RELEASE.md
2022-11-28 10:36:40 -05:00
6ba6b64a79 Ci andriod cache conda (#89554)
Fixes - T137631662

Caching conda dependencies for android build workflows.
Conda dependencies have been gathered from the following workflow
1. https://github.com/pytorch/pytorch/blob/master/.github/workflows/_run_android_tests.yml

The pull request updates the action from conda-incubator/setup-miniconda@v2 to pytorch/test-infra/.github/actions/setup-miniconda@main as it supports caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89554
Approved by: https://github.com/huydhn
2022-11-28 15:02:30 +00:00
2661ff10a9 Include test/distributed/test_dynamo_distributed.py for ciflow/inductor (#89755)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89755
Approved by: https://github.com/anjali411
2022-11-28 15:02:09 +00:00
0d9a615af4 Revert "Add single process version of dynamo distributed hf_Bert tests (#89721)"
This reverts commit 1a2dd6b15e0089a9e45ba4feb90c2d0dfac19238.

Reverted https://github.com/pytorch/pytorch/pull/89721 on behalf of https://github.com/ezyang due to this broke inductor_distributed job
2022-11-28 14:56:54 +00:00
2f8769d680 Support set_rng_state with fake tensor (#89642)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89642
Approved by: https://github.com/anjali411
2022-11-28 14:49:30 +00:00
856e2fa59c Guard traceable_tensor_subclasses patching with finally (#89689)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89689
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-11-28 14:48:12 +00:00
49eb43fc45 Don't modify log level in dynamo distributed test (#89655)
Let the developer decide!

Taken from voz's https://github.com/pytorch/pytorch/pull/89392

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89655
Approved by: https://github.com/albanD
2022-11-28 14:47:52 +00:00
d089fbdc33 supress Werror introduced by lack of override by #86786 on bool initialized() (#89687) 2022-11-28 15:16:15 +01:00
f45fe7de33 Add mypy checking for a few files in torch/_dynamo (#89731)
It's kind of intractable to enable mypy everywhere at the moment,
because there are a lot of errors, and also mypy is really slow
for some reason.  I just want enough types to explain the public
types for user compiler calls, going through typing the _C.dynamo
bindings along the way.  This is a first step for this.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89731
Approved by: https://github.com/suo
2022-11-28 13:14:06 +00:00
55e8b5c126 [xla hash update] update the pinned xla hash (#89405)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89405
Approved by: https://github.com/pytorchbot
2022-11-28 10:27:24 +00:00
b5616cd5f4 Add simple assert to detect fake tensors on modules (#89723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89723
Approved by: https://github.com/ezyang
2022-11-28 08:57:33 +00:00
db1f1144f1 Beef up AOTAutograd logging with aot_id and input descriptions (#89710)
A few things in this PR, that I found useful while debugging some
recent issues:

- We now allocate an aot_id to each aot_function/aot_module invocation,
  and print it whenever we report error messages and graph output
  logging.  Check the comment for why this sort of thing is useful,
  and also why it's different from nth_graph.  This number is now
  incorporated into aot_graph_name

- I noticed that nth_graph only gets incremented when backwards is
  compiled.  Because backwards is compiled lazily, this means that
  multiple forward graphs would have gotten the same ID!  I change
  nth_graph to always increment to avoid confusion here.

- I added a simple describe_input function, which makes use of
  num_params_buffers to tell the user if the input index they're
  looking at is a param/buffer or an input.  With the help of
  https://github.com/pytorch/pytorch/pull/89709 we could give
  even more detailed information about inputs  (we could also
  easily give detailed information about parameters if we stored
  a mapping of index to parameter name, but I didn't need this
  when debugging so I'll let someone else add it if they need
  it.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89710
Approved by: https://github.com/bdhirsh
2022-11-28 04:52:05 +00:00
5f8848f329 Don't suppress log messages for dynamo CI config (#89653)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89653
Approved by: https://github.com/albanD, https://github.com/kit1980
2022-11-28 03:39:40 +00:00
1a2dd6b15e Add single process version of dynamo distributed hf_Bert tests (#89721)
It's a lot easier to debug problems in the Dynamo optimization pass if
you aren't actually triggering a multiprocessing run.  Keep these tests
around.

I think the other tests can probably get this treatment too, leaving
this to future work.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89721
Approved by: https://github.com/voznesenskym
2022-11-28 03:16:47 +00:00
0e7c100c9b Add debug asserts to AOTAutograd for input consistency with compilation (#89702)
Fixes https://github.com/pytorch/torchdynamo/issues/1927

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89702
Approved by: https://github.com/bdhirsh
2022-11-28 00:36:58 +00:00
1f95f24d30 Factor input deduplication into a separate function (#89701)
It turns out that instead of having a giant blobby aot_dispatch_autograd
function, we can factor it into a series of wrapper functions, each
of which successively guarantees more invariants on the inner
compilation function until the final inner function is quite trivial.
How exactly you have to wrap the input user functions and the output
compiled functions can be expressed concisely in Haskell, so I've
included the Haskell formulation in code comments.

This PR shows how to do this for input deduplication.  Dealing with the
rest of the view handling is left to future work.

This PR should also be a slight performance improvement as deduplicating
is skipped entirely when there are no duplicate inputs.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89701
Approved by: https://github.com/bdhirsh
2022-11-28 00:36:58 +00:00
dcefc8f90f Implement guard_source on RandomValueSource (#89711)
I audited the pattern matches on the enum and it didn't
look like this one should apply there.

Sorry, no test, I know this matters on symbolic-shapes branch
but I haven't had time to extract out a minimal reproducer.
Take my word for it.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89711
Approved by: https://github.com/jansel
2022-11-28 00:32:48 +00:00
1da633f98a Access named parameters/buffers/etc via getattr rather than index (#89625)
I'm not sure why this never caused problems before.  The error
manifests as `TypeError: 'MyModule' object is not subscriptable`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89625
Approved by: https://github.com/albanD
2022-11-28 00:19:48 +00:00
e36d68af88 Don't allow recomputing a node that *must* be materialized in the backwards pass (#89171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89171
Approved by: https://github.com/ngimel
2022-11-27 19:09:24 +00:00
b709078dc6 [Profiler] Memory profiler part 11: Mark tensors created in the backward pass which don't correspond to parameters. (#88926)
There are various Tensors created in the backward pass which do not correspond to parameters. We don't want to mark these as gradients, but we do still want to convey as much information as possible. Thus, this PR introduces an AUTOGRAD_DETAIL category. (Which can be grouped with GRADIENT in visualization if one wishes to take a coarse grained view of the world.)

Differential Revision: [D40868661](https://our.internmc.facebook.com/intern/diff/D40868661/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88926
Approved by: https://github.com/chaekit
2022-11-27 12:20:30 +00:00
143d2881a8 [Profiler] Memory profiler part 10: Mark optimizer state (#88925)
This is also a fairly simple pass, since we're simply collecting values from the python tracer.

Differential Revision: [D40868664](https://our.internmc.facebook.com/intern/diff/D40868664/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88925
Approved by: https://github.com/chaekit
2022-11-27 12:20:30 +00:00
ae725d501e [Profiler] Memory profiler part 9: Mark activations (#88924)
This is a fairly straightforward pass: start at inputs and flood fill until we reach the backward pass.

Differential Revision: [D40868662](https://our.internmc.facebook.com/intern/diff/D40868662/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88924
Approved by: https://github.com/chaekit
2022-11-27 12:20:28 +00:00
56e40fe054 Let SyncBatchNorm fallback to BN if not using distributed training (#89706)
Fixes #63662
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89706
Approved by: https://github.com/soumith
2022-11-27 05:55:24 +00:00
39449ea61d [vision hash update] update the pinned vision hash (#89692)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89692
Approved by: https://github.com/pytorchbot
2022-11-27 02:59:06 +00:00
483d3a3d07 [Profiler] E2E expecttests for category assignment (#88653)
Up until now the unit tests for category assignment have been narrowly scoped to specific checks on specific Tensors. However as we start to reach reasonable levels of category assignment it's useful to supplement those tests with higher level summary tests to inspect the larger graph and confirm that it makes sense. (It will also be necessary for some categories like activations where it is tedious to record all relevant Tensors.)

The general structure of these tests is to capture a model invocation with `__torch_dispatch__` and then cross reference those inputs and outputs with the categories assigned by the memory profiler.

Differential Revision: [D40868659](https://our.internmc.facebook.com/intern/diff/D40868659/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88653
Approved by: https://github.com/chaekit
2022-11-27 02:10:29 +00:00
0435894bb3 [Profiler] Memory profiler part 8: Mark parameters. (#87568)
Following the pattern of earlier PRs, we use two methods to extract parameters. The primary one is the Python tracer; both nn.Module and optim.Optimizer collect parameters and in most cases that is sufficient. As a fallback we can analyze the data flow graph and deduce likely parameters based on gradient computation and updates.

Parameter identification has a circular interaction with input identification. Inputs are defined as "not part of the core forward-backward-update loop", but we need inputs for the parameter identification fallback to give us a proxy for the forward pass. Thus, we mark parameters from the python tracer which limits which Tensors get marked as inputs. While not necessary, it adds a bit of robustness. (As shown by the strengthening of the input unit tests.)

Differential Revision: [D40238619](https://our.internmc.facebook.com/intern/diff/D40238619/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87568
Approved by: https://github.com/chaekit
2022-11-27 02:10:29 +00:00
17fa6bf1f5 [Profiler] Memory profiler part 7: Mark inputs (#87567)
It is surprisingly difficult to identify the leaves of the data flow graph. The issue is that inputs and pre-existing parameters look identical until parameter identification takes place. It's not too bad for training since Autograd lets us differentiate between them however I still want the tool to do something reasonable in inference.

Some of this will be ameliorated when a later PR pulls in parameters from python tracing. The current approach is passable, but I will continue to mull over refinements.

Differential Revision: [D40220388](https://our.internmc.facebook.com/intern/diff/D40220388/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87567
Approved by: https://github.com/chaekit
2022-11-27 02:10:27 +00:00
64c5c77cd4 [Profiler] Memory profiler part 6: Mark gradients and temporary intermediates. (#87566)
Semantic assignment will be built up as a series of passes which gradually pin down the regions of a trace. For this reason it is important to be very meticulous in the assignment of categories.

We begin with gradients as they are both straightforward to identify and foundational to subsequent analysis. There are two mechanisms that the profiler can use to tag gradients, each with their own advantages and limitations. The first is direct inspection of the op graph which is generic but predicated on certain features of the Autograd engine. (And therefore not necessarily exhaustive.) The second approach is direct instrumentation via the python tracer. This method relies requires that gradients be attached to an nn.Module parameter and can miss corner cases such as `set_to_none=True` due to the cache structure of the python tracer. Combined these two approaches provide very high coverage.

Temporaries are more straightforward; we can easily add them by trivial local inspection of a data flow node.

Because this is the first PR in the end-to-end section most of the code is building the scaffolding for category bookkeeping and unit testing. (The actual gradient extraction was covered in an earlier PR.)

Differential Revision: [D40220389](https://our.internmc.facebook.com/intern/diff/D40220389/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87566
Approved by: https://github.com/chaekit
2022-11-27 02:10:26 +00:00
5f09a6d573 [Profiler] Memory profiler part 5: Data flow graph (#87006)
The semantic meaning of a Tensor is tightly coupled to its lineage. The data flow graph allows us to identify temporary Tensors, masks, inputs, activations, and more. However one important nuance is that Tensors must be versioned; operations which mutate their inputs can also change the semantic meaning of said inputs.

It is challenging to assemble a complete picture of the data flow in a PyTorch model because ops can, and often do, recursively call into other ops. For the purpose of memory profiling this is an implementation detail, so instead we traverse the op tree to identify top level ops and allocations and then coalesce their children, folding inputs and outputs into the top level Node.

Differential Revision: [D40220391](https://our.internmc.facebook.com/intern/diff/D40220391/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87006
Approved by: https://github.com/chaekit
2022-11-27 00:28:57 +00:00
c3116dd78b [Profiler] Memory profiler part 4: Select top level torch ops (#86880)
In a later PR we will walk the children of these nodes and formulate a node from the entire bundle to build a data flow graph. This PR simply defines what a "top level" op is.

Differential Revision: [D40220387](https://our.internmc.facebook.com/intern/diff/D40220387/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86880
Approved by: https://github.com/chaekit
2022-11-27 00:28:57 +00:00
bb77accb4c [Inductor] Record cpp kernel in PyTorch Profiler (#89367)
Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367
Approved by: https://github.com/jansel
2022-11-26 14:06:44 +00:00
36018a6ee6 Don't suppress exceptions from backends (#89656)
Taken from voz's https://github.com/pytorch/pytorch/pull/89392

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89656
Approved by: https://github.com/voznesenskym
2022-11-26 03:18:05 +00:00
3e20d023b1 put descriptive kernel names behind config (#89697)
Per title, generated kernel names are often long and confusing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697
Approved by: https://github.com/Chillee
2022-11-26 03:08:23 +00:00
591dfffa38 update docstring for torch.linalg.lstsq (#89383)
Previous documentation lacked details about the handling of over- and underdetermined systems, and made incorrect mention of MAGMA.

Fixes #85021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89383
Approved by: https://github.com/lezcano
2022-11-25 21:31:53 +00:00
c9a0cc8640 Simplify aot_module_simplified by removing top_args/top_kwargs (#89666)
This makes good on Chillee's CR comment at
af30d351cc (r843315222)
which was never done in the original PR.

There is no logic change, just unpack the args/kwargs at the top
level and remove the inner function indirection.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89666
Approved by: https://github.com/voznesenskym
2022-11-25 20:43:13 +00:00
6168f22fae Don't support kwargs at runtime in aot_module_simplified (#89664)
The preexisting logic here added in
https://github.com/pytorch/functorch/pull/970 was very peculiar: if top_kwargs
was non-empty, then the inner compiled function supports kwargs.  Naively, this
would leave you to expect that there is some sort of correlation between
top_kwargs and kwargs.  But in fact, they're completely unrelated!  top_kwargs
is the AOTAutograd configuration knobs (e.g., fw_compiler/bw_compiler), but
kwargs is the RUNTIME kwargs that are to be passed to the compiled function.
But (1) we don't support this (the function to be compiled only takes a list
of tensors) and (2) even if we did support it, conditioning on whether or not
you had passed AOTAutograd configuration kwargs to support kwargs at runtime
is bonkers.

So delete it.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89664
Approved by: https://github.com/voznesenskym
2022-11-25 20:43:13 +00:00
b04dda4291 Delay verify correctness wrapping to call site. (#89662)
There is only one call site for compiler_fn, so we can safely delay
wrapping verify correctness to here.  This will help later when we
change the backend compiler calling convention to pass fake tensors
(but I need to pass real tensors here.)

This is adapted from voz's changes at https://github.com/pytorch/pytorch/pull/89392
but with less changes to the substantive logic.  I only moved the relevant
inner implementation; there are no changes otherwise.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89662
Approved by: https://github.com/voznesenskym
2022-11-25 20:43:11 +00:00
61a3fe4b64 make inductor correctly propagate nans for maximum and minimum (#89612)
Partially fixes https://github.com/pytorch/torchdynamo/issues/594
Also, small cleanup for `where` codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-25 19:42:38 +00:00
70c0a3006e Fix typo in segment_reduction_op_gpu.cu (#89647)
menber -> member

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89647
Approved by: https://github.com/kit1980
2022-11-25 19:26:18 +00:00
2c0bd85c75 complex: register c10::complex with py::cast (#89680)
Fixes #77134

TODO:
* [x] Add test (tested locally with script below) (Are there similar tests in the test-suite?)

```c++
#include <torch/torch.h>
#include <torch/csrc/utils/pybind.h>
#include <iostream>
#include <vector>
#include <pybind11/pybind11.h>
#include <pybind11/embed.h>
#include <cassert>

namespace py = pybind11;

int main() {
    py::scoped_interpreter guard{}; // start the interpreter
    auto casted_cdouble = py::cast(c10::complex<double>(1.0, 2.0));
    assert(
        (c10::complex<double>(1.0, 2.0) ==
         py::cast<c10::complex<double>>(casted_cdouble)));

    auto casted_cfloat = py::cast(c10::complex<float>(1.0, 2.0));
    assert(
        (c10::complex<double>(1.0, 2.0) ==
         py::cast<c10::complex<double>>(casted_cfloat)));

    auto casted_chalf = py::cast(c10::complex<at::Half>(1.0, 2.0));
    assert(
        (c10::complex<double>(1.0, 2.0) ==
         py::cast<c10::complex<double>>(casted_chalf)));
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89680
Approved by: https://github.com/ezyang
2022-11-25 14:53:57 +00:00
abb446af8c Implement old windows in Python (#87082)
Relates to #85366

- Bartlett, Blackman, Hamming, Hann.
- Except Kaiser which will be in a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87082
Approved by: https://github.com/mruberry, https://github.com/lezcano
2022-11-25 11:09:28 +00:00
95ea47ef0c torchdynamo to torch._dynamo in aot_autograd.py (#89385)
Test Plan: Run torchbench models

Differential Revision: D41429573

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89385
Approved by: https://github.com/soumith, https://github.com/malfet
2022-11-25 04:28:36 +00:00
6904324781 Remove fake_tensor_propagation (#89646)
You always have to run dynamo with fake tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89646
Approved by: https://github.com/soumith
2022-11-25 03:27:32 +00:00
1aa1014b26 xfail maml test, instead of running it without fake tensor prop (#89645)
A previous version of this patch graph breaks when torch.tensor fails, but that causes

```
PYTORCH_TEST_WITH_DYNAMO=1 python test/nn/test_embedding.py -k test_embedding_bag_1D_padding_idx_cpu_float32
```

to start failing. Probably another latent bug that needs investigating.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89645
Approved by: https://github.com/albanD
2022-11-25 03:27:32 +00:00
a048913e25 [vision hash update] update the pinned vision hash (#89667)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89667
Approved by: https://github.com/pytorchbot
2022-11-25 03:03:43 +00:00
3b3ebcd031 TorchDynamo: weight prepack for single conv (#89209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89209
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-25 01:23:11 +00:00
0c4f3db7bf TorchDynamo: weight prepack for mkl linear (#89109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89109
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-25 01:20:19 +00:00
07151a6bd6 TorchDynamo: weight prepack for onednn convolution external call (#88988)
This PR is about enabled weight prepack using the MKLDNN tensor:
1.  enable fake tensor mode for MKLDNN tensor input.
2.  make convolution fusion kernel support MKLDNN tensor input.
3. do the weight prepack at FX fusion step.

For better performance, we always use channels_last for CPU convolution path. because we test that the channels_last path can get a better performance than block input path, and also avoid the activation's layout conversion(plain to block, block to plain), currently, there only need plain to plain format conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88988
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-25 01:16:11 +00:00
0884fdaba0 Revert "Dont clone unmutated args in triton autotuning (#89519)" (#89652)
This reverts commit f18f0c70ab10c400947e71be30794e04dcc22acf.

Testing to see if this fixes gmixer_24_224 mixer_b16_224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89652
Approved by: https://github.com/eellison
2022-11-24 22:49:09 +00:00
4a16f8cdb2 Reenable fake_tensor_propagation on test_cudnn_rnn (#89644)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89644
Approved by: https://github.com/anjali411
2022-11-24 22:46:49 +00:00
fc7dcb684a Run optimizer tests with fake tensors (#89643)
This is a slight regression: RAdam and Adagrad don't appear to
trace at all under fake tensors.  But I think this is a more accurate
reflection of the current state of affairs.

Along the way fix some problems on the fake tensor path.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89643
Approved by: https://github.com/anjali411
2022-11-24 22:46:49 +00:00
9b13508ef3 Force test_rng_state to run with fake tensor prop (#89641)
I'm not really sure what desertfire's intended follow up was
on https://github.com/pytorch/pytorch/pull/87490 because when I remove
the unsupported() call, dynamo tests pass.  But the change here is
conservative and I think strictly better than the current situation.
The idea is to force fake tensor pop on for the test, and then just
observe that we are doing a graph break.  Clearly, export doesn't work,
so I manually xfail it.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89641
Approved by: https://github.com/anjali411
2022-11-24 22:46:47 +00:00
c6be06d93a Easy: These tests work with fake_tensor_propagation on (#89640)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89640
Approved by: https://github.com/anjali411, https://github.com/albanD
2022-11-24 22:46:45 +00:00
6fb6eb0a74 Support unspecialized integers with dynamic shapes (#89639)
Previously, we hackily wrapped unspecialized integers into
tensors and treated them as tensor inputs.  Sometimes, downstream
operations would not be able to deal with the tensor input.  Now,
we wrap them into SymInt, so more correct overload selection occurs.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89639
Approved by: https://github.com/anjali411
2022-11-24 22:46:42 +00:00
0c96841a20 Cond capture with fake tensors actually works; don't raise in this case (#89638)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89638
Approved by: https://github.com/anjali411
2022-11-24 22:46:40 +00:00
d3c012f409 [test_nn] split pruning tests from test_nn (#89590)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89590
Approved by: https://github.com/albanD
2022-11-24 21:41:22 +00:00
83666f167d Added vectorized CPU code for uint8_t datatype. (#89284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89284
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2022-11-24 19:58:40 +00:00
9497552771 Update SyncBatchNorm _all_gather_base to all_gather_into_tensor (#89521)
Summary: Fixes https://github.com/pytorch/pytorch/issues/88568

`_all_gather_base` is deprecated. So replacing its usage with `all_gather_into_tensor`

Test Plan: CI

Differential Revision: D41479983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89521
Approved by: https://github.com/wz337
2022-11-24 19:41:17 +00:00
94a88b53ed Remove fake_tensors_available (#89637)
As we are one repo now, they are always available.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89637
Approved by: https://github.com/anjali411
2022-11-24 19:28:10 +00:00
1c8b0779de Fix segfault when swapping custom allocator (#89613)
Just screwed it before merging ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89613
Approved by: https://github.com/albanD
2022-11-24 18:25:28 +00:00
fd279fe85b Make pytest work again on test/dynamo (#89631)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89631
Approved by: https://github.com/anjali411
2022-11-24 17:24:25 +00:00
c3e85d879c Mention discrepency between original impl and our impl of RAdam (#89575)
Fixes https://github.com/pytorch/pytorch/issues/88836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89575
Approved by: https://github.com/mruberry
2022-11-24 17:11:42 +00:00
860bae49e4 Suppress guards on as_strided call only. (#89569)
See comment in meta_utils.py for the whole story.

This doesn't have a substantive impact yet, but will in the next
PR on the stack.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89569
Approved by: https://github.com/albanD
2022-11-24 14:01:12 +00:00
1588ea0dbf Added log1p for complex in c10 (#89214)
One PR towards #89205.
The content is mostly from PR #38465, but slightly changed the expression to make it faster.

Here are some benchmarking code:
```c++
#include <complex>
#include <iostream>
#include <chrono>

// main.cc

template<typename T> inline std::complex<T> log1p_v0(const std::complex<T> &z) {
    // this PR
    T x = z.real();
    T y = z.imag();
    T theta = std::atan2(y, x + T(1));
    T r = x * (x + T(2)) + y * y;
    return {T(0.5) * std::log1p(r), theta};
}

template<typename T> inline std::complex<T> log1p_v1(const std::complex<T> &z) {
    // PR #38465
    T x = z.real();
    T y = z.imag();
    std::complex<T> p1 = z + T(1);
    T r = std::abs(p1);
    T a = std::arg(p1);
    T rm1 = (x * x + y * y + x * T(2)) / (r + 1);
    return {std::log1p(rm1), a};
}

template<typename T>
inline std::complex<T> log1p_v2(const std::complex<T> &z) {
    // naive, but numerically inaccurate
    return std::log(T(1) + z);
}

int main() {
    int n = 1000000;
    std::complex<float> res(0.0, 0.0);
    std::complex<float> input(0.5, 2.0);
    auto start = std::chrono::system_clock::now();
    for (int i = 0; i < n; i++) {
        res += log1p_v0(input);
    }
    auto end = std::chrono::system_clock::now();
    auto elapsed = end - start;
    std::cout << "time for v0: " << elapsed.count() << '\n';

    start = std::chrono::system_clock::now();
    for (int i = 0; i < n; i++) {
        res += log1p_v1(input);
    }
    end = std::chrono::system_clock::now();
    elapsed = end - start;
    std::cout << "time for v1: " << elapsed.count() << '\n';

    start = std::chrono::system_clock::now();
    for (int i = 0; i < n; i++) {
        res += log1p_v2(input);
    }
    end = std::chrono::system_clock::now();
    elapsed = end - start;
    std::cout << "time for v2: " << elapsed.count() << '\n';
    std::cout << res << '\n';
}
```

Compiling the script with command `g++ main.cc` produces the following results:
```
time for v0: 237812271
time for v1: 414524941
time for v2: 360585994
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89214
Approved by: https://github.com/lezcano
2022-11-24 11:11:51 +00:00
4f5c4c022a [LTC] Refine MetricsArena::Reset (#89608)
Summary:
After counters are reset, getters' behaviors are inconsistent. To improve that, here I 1) move the validation of CounterData into CounterData::IsValid such that it's better encapsulated, 2) divide getters into two groups: a) MetricsArena::GetCounter() and b) MetricsArena::ForEachCounter(), and route MetricsArena::GetCounterNames() and CreateMetricReport() to use b.

This is paired with pytorch/xla#4217.

Test Plan:
PJRT_DEVICE=CPU python xla/test/test_metrics.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89608
Approved by: https://github.com/JackCaoG
2022-11-24 10:57:03 +00:00
a8629a1c18 Upgrade nightly wheels to ROCm5.3 (#89101)
Dependent on PR https://github.com/pytorch/builder/pull/1193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89101
Approved by: https://github.com/kit1980
2022-11-24 10:53:22 +00:00
c0d81aa70c Use fx.replace_pattern for removing empty_like+fill in nvFuser+PrimTorch execution (#89132)
I learned about `torch.fx.replace_pattern` and it's a cleaner way of removing unnecessary tensor materialization from the graph coming from tracing  C++ code `1 - tensor`.

Test:
```
python -m pytest test/test_prims.py -k "test_silu_backward_no_filled_tensor"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89132
Approved by: https://github.com/mruberry, https://github.com/jjsjann123
2022-11-24 09:37:10 +00:00
b515c1d960 [QAT] Check the value of numel to avoid segfault (#81547)
Fixes #78123

### Original Result

Segmentation fault

### Result after fix

RuntimeError: numel is out of the bound of input tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81547
Approved by: https://github.com/kit1980
2022-11-24 08:14:24 +00:00
22a1b5e243 quantization: deprecate observer compute_dtype and replace with is_dynamic (#85431)
Summary:

This PR deprecates the `compute_dtype` field on observers, and replaces
it with the `is_dynamic` field on observers.  This is better aligned
with the reference model spec.

Test plan:

```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85431
Approved by: https://github.com/jerryzh168
2022-11-24 07:07:34 +00:00
e4ccec6eca [Dynamo] Fix bug of using customized torch.autograd.Function (#89397)
Fixes https://github.com/pytorch/torchdynamo/issues/1899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89397
Approved by: https://github.com/jansel
2022-11-24 05:28:58 +00:00
903ae4570e Disable optimizer tracing, enable for tests only (#89500)
Disabling optimizer tracing before launch until it can be added to the benchmark suites without increasing compile times

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89500
Approved by: https://github.com/anijain2305
2022-11-24 04:15:34 +00:00
c79489c8e6 Expose to python the backward AD view_func (#89586)
This will be useful for other systems (AOTAutograd) that want to replay autograd views.

FYI @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89586
Approved by: https://github.com/soulitzer
2022-11-24 03:39:58 +00:00
4cb6bbbe27 Symintify embedding (#89327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89327
Approved by: https://github.com/ezyang
2022-11-24 03:25:00 +00:00
9c867eae1a nnc: fix Store if value is fp32 while buf is bf16 (#86788)
Fixes https://github.com/pytorch/pytorch/issues/86533.
For the below graph:
```bash
[DUMP kernel.cpp:1690] TensorExprKernel graph:
[DUMP kernel.cpp:1690] graph(%x.1 : BFloat16(10, strides=[1], requires_grad=0, device=cpu)):
[DUMP kernel.cpp:1690]   %1 : int = prim::Constant[value=0]()
[DUMP kernel.cpp:1690]   %2 : BFloat16(10, strides=[1], requires_grad=0, device=cpu) = aten::pow(%x.1, %1) # test/test_tensorexpr.py:1330:29
[DUMP kernel.cpp:1690]   %3 : BFloat16(10, strides=[1], requires_grad=0, device=cpu) = aten::sin(%2) # test/test_tensorexpr.py:1330:19
[DUMP kernel.cpp:1690]   return (%3)
```

**Loop stmt before the fix:**
The store value `0.8414709568023682f` is float while the scalar_type of the store buf `aten_sin` is bf16.
```bash
[DEBUG llvm_codegen.cpp:489] After HalfRewriter {
[DEBUG llvm_codegen.cpp:489]   aten_sin[Ramp(0ll, 1ll, 8)] = Broadcast(0.8414709568023682f, 8);
[DEBUG llvm_codegen.cpp:489]   for (int64_t i_1_tail_tail = 0ll; i_1_tail_tail < 2ll; i_1_tail_tail++) {
[DEBUG llvm_codegen.cpp:489]     aten_sin[i_1_tail_tail + 8ll] = 0.8414709568023682f;
[DEBUG llvm_codegen.cpp:489]   }
[DEBUG llvm_codegen.cpp:489] }
```

**Loop stmt after the fix:**
```bash
[DEBUG llvm_codegen.cpp:489] After HalfRewriter {
[DEBUG llvm_codegen.cpp:489]   aten_sin[Ramp(0ll, 1ll, 8)] = bfloat16(Broadcast(0.8414709568023682f, 8));
[DEBUG llvm_codegen.cpp:489]   for (int64_t i_1_tail_tail = 0ll; i_1_tail_tail < 2ll; i_1_tail_tail++) {
[DEBUG llvm_codegen.cpp:489]     aten_sin[i_1_tail_tail + 8ll] = bfloat16(0.8414709568023682f);
[DEBUG llvm_codegen.cpp:489]   }
[DEBUG llvm_codegen.cpp:489] }
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86788
Approved by: https://github.com/EikanWang, https://github.com/kit1980
2022-11-24 02:52:34 +00:00
f0e5bc4b9f Symintified layer_norm (#89466)
Summary: As titled.

Test Plan:
```
buck2 run mode/opt scripts/wwei6:test_executorch
```

Differential Revision: D41451390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89466
Approved by: https://github.com/frank-wei, https://github.com/ezyang
2022-11-24 02:18:32 +00:00
fdb2dd113d Install missing VSX headers (POWER) (#85547)
E.g. `test_cpp_extensions_aot_ninja` fails as it includes `vec.h` which requires the vec/vsx/* headers and `sleef.h`. The latter is also required for AVX512 builds on non MSVC compilers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85547
Approved by: https://github.com/kit1980
2022-11-24 01:52:11 +00:00
e922bd4e52 [ONNX] Move two headers from .h to .cc (#86852)
As title. Header dependency should be as small as possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86852
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2022-11-24 01:30:09 +00:00
23fe2ff910 verify the number of outputs of xla graph (#89536)
This PR add tests to verify the behavior of number of outputs returns by an XLA graph. The understanding from this PR will help us fix https://github.com/pytorch/torchdynamo/issues/1908 and enable training for dynamo/torchxla integration eventually. Send this PR separately so Jack could help verify if the behavior is expected and play with it.

List some code snippets here since their behavior is not straightforward at a first glance:
```
    def forward(self, a, b, c):
        """
        The XLA graph will only return the first 2 items
        """
        return a + b, a + c, b
```

```
    def forward(self, a, b, c):
        """
        Inplace update on b cause it to be returned in XLA graph
        """
        b.zero_()
        return a + b, a + c, b
```

```
    def forward(self, a, b, c):
        """
        Even if we return b twice, the XLA graph only return b once.
        """
        b.zero_()
        return a + b, a + c, b, b
```

Here are what observed by the added tests:

1. XLA does not return outputs that are also inputs -- if the tensor is not inplace updated. At first glance people may feel curious why should we consider this kind of 'non-realistic' corner case. But this kind of graphs indeed shows up in AOTAutograd. The main reason is AOTAutograd lift all model parameters/buffers as graph input and may return some of them.  Check ***test_direct_return***
2. if a tensor is inplace updated, XLA will still return it as graph output even if it's also an input.  The only difference compared to item 1 is, the inplace updating on the tensor cause it being returned. This happens for BatchNorm2d since the running_mean/variance tensors will be inplace updated during training. Check ***test_direct_return_with_inplace_update***

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89536
Approved by: https://github.com/jansel
2022-11-24 01:28:13 +00:00
0bde514981 Add c10:: namespace in front of optional (#89605)
Prep change for moving the codebase to C++17 standard
Was part of https://github.com/pytorch/pytorch/pull/85969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89605
Approved by: https://github.com/weiwangmeta, https://github.com/kit1980
2022-11-24 00:57:17 +00:00
e19a7165fd [nn] Remove deprecation warning from nn.functional.{tanh, sigmoid} (#86905)
Fixes #65909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86905
Approved by: https://github.com/albanD, https://github.com/kit1980
2022-11-24 00:34:26 +00:00
a00bd6f686 Don't run auto request review on forked PRs (#89583)
tested on https://github.com/pytorch/pytorch/pull/89581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89583
Approved by: https://github.com/albanD, https://github.com/malfet
2022-11-23 23:48:35 +00:00
0a1a53083e [primTorch] Enable regex error testing for some refs (#87765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87765
Approved by: https://github.com/mruberry
2022-11-23 23:36:27 +00:00
3ad2a032f4 Update default cmake to 3.18 (#89570)
Set `cmake.dir` to `/usr/local` in `.circleci/scripts/build_android_gradle.sh `
Prep change for raising compiler standard to C++17: cmake-3.18 is the first one to support CUDA17 language

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89570
Approved by: https://github.com/atalman
2022-11-23 23:23:26 +00:00
8695f0cced Rectify native_batch_norm schema by splitting it into two legit schemas (#88697)
Using the same repro from the issue (but with BatchNorm2D)

Rectifies native_batch_norm schema by splitting the schema into 2:
1. one will have NON-optional alias-able running_mean and running_var inputs
2. the other will just not have those parameters at all (no_stats variation)

**Calling for name suggestions!**

## test plan
I've added tests in test_functionalization.py as well as an entry in common_method_invocations.py for `native_batch_norm_legit`
CI should pass.

## next steps
Because of bc/fc reasons, we reroute native_batch_norm to call our new schemas ONLY through the python dispatcher, but in 2 weeks or so, we should make `native_batch_norm_legit` the official batch_norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88697
Approved by: https://github.com/albanD
2022-11-23 23:23:17 +00:00
a00efe55c3 Fix CheckOutputStreamSetting on JitLoggingTest as it failed if logging wasn't enabled. (#82722)
`JIT_LOG` checks if logging was enabled for that particular file and when it isn't it doesn't output anything. Since the test checks for the size of `test_stream` it fails. I believe forcing the file to have logging enabled to see if the stream is being correctly set during test makes no sense so this patches just forcibly outputs and checks if it worked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82722
Approved by: https://github.com/davidberard98
2022-11-23 22:46:29 +00:00
b8d3afd886 Skip upload test stats for test reports from rerun disabled tests workflow (#89548)
I have found the reason why uploading tests stats fails for rerun disabled workflow, for example https://github.com/pytorch/pytorch/actions/runs/3522896778/jobs/5917765699.  The problem is that the pytest XML file is now too big to be processed quickly (x50 bigger). Unlike unittest, `pytest-flakefinder` used by rerun disabled tests for test_ops includes skipped messages multiple times (50 times by default, retrying and skipping).  This slows down the upload test stats script too much (O(n)) because it tries to gather all the stats. On the other hand, `check_disabled_tests` doesn't suffer from the same issue because it ignores all these skipped messages.

This is a quick fix to skip test reports from rerun disabled tests workflow when trying to upload test stats.

I'll try to fix this properly later in the way we use pytest-flakefinder. From what I see, a zipped test report from rerun disabled test is only few MB ([example](https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3521687954/1/artifact/test-reports-test-default-1-2-linux.2xlarge_9636028803.zip)), but will balloon up to a much bigger XML file after extracting from a dozen to a few hundred MB (text).  The size of the zipped file is not a big immediate problem

### Testing

[3521687954](https://github.com/pytorch/pytorch/actions/runs/3521687954) is an example workflow with rerun disabled tests and mem leak check.  The script can now finish when running locally:

* `upload_test_stats` finishes around 3+ minutes
```
time python -m tools.stats.upload_test_stats --workflow-run-id 3521687954 --workflow-run-attempt 1 --head-branch master
...
Writing 8925 documents to S3
Done!
Writing 1760 documents to S3
Done!
Writing 1675249 documents to S3
Done!
python3 -m tools.stats.upload_test_stats --workflow-run-id 3521687954  1    185.69s user 12.89s system 75% cpu 4:22.82 total
```

* `check_disabled_tests` finishes within 3 minutes
```
time python -m tools.stats.check_disabled_tests --workflow-run-id 3521687954 --workflow-run-attempt 1 --repo pytorch/pytorch
...
python -m tools.stats.check_disabled_tests --workflow-run-id 3521687954  1    154.19s user 4.17s system 97% cpu 2:42.50 total
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89548
Approved by: https://github.com/clee2000
2022-11-23 22:39:39 +00:00
f18f0c70ab Dont clone unmutated args in triton autotuning (#89519)
Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great.

Edit: i think it's just the triton cache clearing 44f577984d/python/triton/testing.py (L159)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-23 22:00:03 +00:00
ac19c5be82 FFT: disable dimension wrapping for scalar tensors (#89234)
Fixes #88985

By default, `maybe_wrap_dim` allows through `dim=0` or `dim=-1`
for scalar tensors which leads to an invalid dimension being used to
index into `tensor.sizes()` as in the code sample from the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89234
Approved by: https://github.com/mruberry
2022-11-23 21:55:00 +00:00
50e2e4faf3 Sparse CSC/BSR/BSC serialization and pickle support (#89553)
Fixes https://github.com/pytorch/pytorch/issues/89497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89553
Approved by: https://github.com/cpuhrsch
2022-11-23 20:56:48 +00:00
a8d6b82167 Fix norm decomp when dtype is passed in (#89508)
Fix for https://github.com/pytorch/torchdynamo/issues/1889. The wrapper was doing a downcast even when the dtype was explicitly passed in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89508
Approved by: https://github.com/anijain2305
2022-11-23 20:49:09 +00:00
72110d7833 Fix Upsample Decomp Striding For Small Channels (#89528)
Fix for https://github.com/pytorch/torchdynamo/issues/623.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89528
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2022-11-23 20:47:39 +00:00
b7483be06a [quant][docs] Add docstrings for operators defined in torch.ops.quantized_decomposed namespace (#89547)
Summary:
no functionality changes

Test Plan:
NA

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89547
Approved by: https://github.com/vkuzo
2022-11-23 20:40:53 +00:00
a188f05e8c Reland #89031 Added conv constraint that infers layouts (#89530)
Relands #89031
Per title. We now set strides from fx graph only for convolutions and mm, which is a hack, but bmm in some cases caused extra copy, and there is no obvious way to fix that, we should rethink the strides anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89530
Approved by: https://github.com/Chillee
2022-11-23 20:18:54 +00:00
e800d27b10 [dashboard] Add graphs for all summary metrics, add additional testing flags (#89580)
Title. Test post: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1325572179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89580
Approved by: https://github.com/davidberard98
2022-11-23 20:11:39 +00:00
953f39578a Mark IPU device as not supports_as_strided (#89130)
Currently causes issues in calls to `.to`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89130
Approved by: https://github.com/albanD
2022-11-23 19:51:53 +00:00
37e46a5035 [Dynamo] Fix several bugs & code refactor in RangeVariable (#89322)
Fix bug in [7k github models](https://github.com/pytorch/torchdynamo/issues/1884): https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_clovaai_stargan_v2.py
```
E       TypeError: 'list' object cannot be interpreted as an integer
E
E       from user code:
E          File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_clovaai_stargan_v2.py", line 335, in forward
E           idx = torch.LongTensor(range(y.size(0)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89322
Approved by: https://github.com/jansel
2022-11-23 19:44:48 +00:00
91dcef41ae Thread PG: add allreduce to threaded pg (#89043)
Summary:
Goal
Add `all_reduce` collective  to multi-threaded ProcessGroup added in D40236769 (6663ae5537).

Code Motion
Added `allreduce` collective to ProcessLocalGroup (a subclass of c10d ProcessGroup).

What's Next
Add a DDP test utilizing the new allreduce op.
Generalize `allreduce` to allow other `ReduceOp`s besides `SUM`.

Test Plan:
cd fbcode/caffe2
buck2 test mode/dev //caffe2/test/distributed:multi_threaded

Differential Revision: D41046606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89043
Approved by: https://github.com/wanchaol
2022-11-23 19:43:30 +00:00
27db806888 Handle Tensor.__deepcopy__ via clone(), on IPU (#89129)
Currently it falls through to a call to `storage()`, which the IPU doesn't support.

I've made the minimal change here for ease of merging (this'd help us if it was in for 1.13.1), however...

**QUESTION**: Is there any reason why `not torch._C._has_storage(self)` needs to *also* be guarded on `self.device.type == privateuseone`? in other words, could the condition for using `clone` not be this?

```python
self.is_sparse
or self.device.type
in ["lazy", "xla", "mps", "ort", "meta", "hpu", "ipu"]
or not torch._C._has_storage(self)
or (type(self) is not Tensor and self.data_ptr() == 0)
```

If the condition fails, the very next thing is a call to `self._typed_storage()` which will fail, so it feels to me like *any* case without storage shouldn't fall through to the `storage()` call.

The original PR for adding the 'no storage and device is `PrivateUse1`' condition ([86557](https://github.com/pytorch/pytorch/pull/86557)) doesn't discuss whether this could be broadened.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89129
Approved by: https://github.com/albanD
2022-11-23 19:41:09 +00:00
fa7a963f65 Remove BaseException TODO (#89540)
After discussion in https://github.com/pytorch/pytorch/pull/88461#issuecomment-1318965664
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89540
Approved by: https://github.com/H-Huang
2022-11-23 19:39:49 +00:00
9eed6b7f9a [Dynamo] Several fixes on TensorVariable & TorchVariable (#89486)
This is a group of bug fixes for [7k github models](https://github.com/pytorch/torchdynamo/issues/1884), it would fix 30+ model tests.
* Support ```tensor.type()```.
* Support ```tensor.get_device()```.
* Support ```torch.nn.functional._Reduction.get_enum```.
* Support ```torch._utils._get_device_index()```.
* Fallback ```tensor.data_ptr()```.
  * ```FakeTensor``` always returns 0
  * For no fake tensor propagation, we ```clone``` the input tensor, which makes no sense to track the original ```data_ptr```. And I don't think this is a very popular API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89486
Approved by: https://github.com/jansel
2022-11-23 19:39:45 +00:00
f03e6672fb [Checkpoint][2D] Minor update for dedup_tensors.py (#89542)
Rename variables for better readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89542
Approved by: https://github.com/H-Huang
2022-11-23 19:39:04 +00:00
74703eb502 [Checkpoint] Add a logger to dedup_tensors (#89503)
Add a logger to dedup_tensors to log the duplicate keys to remove in global plan (List of SavePlan).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89503
Approved by: https://github.com/fduwjj
2022-11-23 19:36:03 +00:00
57353c9608 first draft of input mutation handling for aot autograd (#88817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88817
Approved by: https://github.com/ezyang, https://github.com/wconstab
2022-11-23 19:20:11 +00:00
902e4e3926 Revert "Fix the kineto daemon build condition (#89174)"
This reverts commit 9fd00f194ae4e28948a9a03a6382c20dde04e4fd.

Reverted https://github.com/pytorch/pytorch/pull/89174 on behalf of https://github.com/robieta due to For some reason this is interacting badly with NVFuser. I think it is instability in kineto, but until we figure out what's going on reverting is a necessary evil.
2022-11-23 19:05:14 +00:00
049a0f2cd5 [inductor] Update CI model tests (#89499)
Summary:
1) Add model inference test
2) Switch model training test to use AMP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89499
Approved by: https://github.com/bertmaher
2022-11-23 18:30:51 +00:00
95474e00a9 [quant][be] Remove unused util code (#89272)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89272
Approved by: https://github.com/andrewor14
2022-11-23 18:27:41 +00:00
128faf2b69 [quant][be] Refactor the error checking code for quantize_per_channel op (#89271)
Summary:
at

Test Plan:
make sure it compiles

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89271
Approved by: https://github.com/andrewor14
2022-11-23 18:27:41 +00:00
71c0e84914 Gate leak check and reruns on schedule (#89504)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89504
Approved by: https://github.com/huydhn
2022-11-23 18:27:37 +00:00
c9d4390d13 Add Pluggable CUDA allocator backend (#86786)
Fixes #43144

This uses the Backend system added by [82682](https://github.com/pytorch/pytorch/pull/82682) to change allocators dynamically during the code execution. This will allow us to use RMM, use CUDA managed memory for some portions of the code that do not fit in GPU memory. Write static memory allocators to reduce fragmentation while training models and improve interoperability with external DL compilers/libraries.

For example, we could have the following allocator in c++

```c++
#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>

extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   std::cout<<"alloc "<< size<<std::endl;
   cudaMalloc(&ptr, size);
   return ptr;
}

void my_free(void* ptr) {
   std::cout<<"free "<<std::endl;
   cudaFree(ptr);
}
}
```

Compile it as a shared library
```
nvcc allocator.cc -o alloc.so -shared --compiler-options '-fPIC'
```

And use it from PyTorch as follows

```python
import torch

# Init caching
# b = torch.zeros(10, device='cuda')
new_alloc = torch.cuda.memory.CUDAPluggableAllocator('alloc.so', 'my_malloc', 'my_free')
old = torch.cuda.memory.get_current_allocator()
torch.cuda.memory.change_current_allocator(new_alloc)
b = torch.zeros(10, device='cuda')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(old)
```

Things to discuss
- How to test this, needs compiling external code ...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86786
Approved by: https://github.com/albanD
2022-11-23 17:54:36 +00:00
1333fdcff1 [test_nn] split parametrization test from test_nn (#89552)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89552
Approved by: https://github.com/albanD
2022-11-23 17:27:40 +00:00
347a7d97a5 Deprecate decorating classes with torch.no_grad and similar (#89522)
Fixes https://github.com/pytorch/pytorch/issues/89450

I would have completely removed it but I don't think this is particularly urgent and there are some use of it in the wild: https://github.com/search?q=%2Ftorch%5C.no_grad%5C%28%5C%29%5Cnclass%2F&type=code
So we might as well take one release to do it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89522
Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/janeyx99
2022-11-23 16:51:42 +00:00
2de38a0714 Add torch._dynamo to docs (#89510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89510
Approved by: https://github.com/msaroufim
2022-11-23 16:33:13 +00:00
de0dee30d0 [PT-D][3/N] Sync TP API change to Pytorch (#89535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89535
Approved by: https://github.com/wanchaol
2022-11-23 16:13:49 +00:00
795473ff5e Call symint::sizes() instead of sizes() on convolution error messages. (#89549)
This PR fixes convolution when using `torchdynamo` with dynamic shapes.

**Problem:** there are some `tensor.sizes()` calls in a few error messages. As a result, an uninformative error message was being displayed.

```python
@torch._dynamo.optimize("eager")
def foo(inp, w):
    return F.conv2d(inp, w)

inp = torch.rand((1, 1, 32, 32))
w = torch.rand((1, 2, 3, 3))
#                  |
#                  |--------- incorrect shape!

foo(inp, w)
```

-----
**Before this PR:**
```python
Traceback (most recent call last):
  File "torch/_dynamo/utils.py", line 1076, in run_node
    return node.target(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 867, in __torch_dispatch__
    op_impl_out = op_impl(self, func, *args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 445, in conv
    conv_backend = torch._C._select_conv_backend(**kwargs)
RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides
```

**After this PR:**
```python
Traceback (most recent call last):
  File "torch/_dynamo/utils.py", line 1076, in run_node
    return node.target(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 867, in __torch_dispatch__
    op_impl_out = op_impl(self, func, *args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 445, in conv
    conv_backend = torch._C._select_conv_backend(**kwargs)
RuntimeError: Given groups=1, weight of size [1, s1, s2, s2], expected input[1, 1, s0, s0] to have s1 channels, but got 1 channels instead
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89549
Approved by: https://github.com/ezyang
2022-11-23 15:56:54 +00:00
39772a6a01 [quant] Add support for quantize_per_channel in the reference flow with decomposed tensor (#89270)
Summary:
att, after this PR we can produce quantize_per_channel and dequantize_per_channel ops (typically used for quantizing weights)
in the reference flow using decomposed tensor

Test Plan:
python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_per_channel_quant

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89270
Approved by: https://github.com/vkuzo
2022-11-23 10:57:04 +00:00
c651944f92 [test_nn] split hooks test from test_nn (#89201)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89201
Approved by: https://github.com/albanD
2022-11-23 08:39:45 +00:00
dd140fc351 [test_nn] move init tests from test_nn (#89202)
Ref: https://github.com/pytorch/pytorch/issues/63085

Note: Doesn't need corresponding XLA PR as the migrated tests were not run on XLA (as they weren't in TestNNDeviceType).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89202
Approved by: https://github.com/albanD
2022-11-23 08:30:51 +00:00
7594e043b8 Fix Use-after-Free in qembeddingbag_byte_prepack_out (#84750)
When FBGEMM is not used (either manually disabled or on platforms such as POWER where it isn't supported at all) the fallback code requests a `data_ptr<float>` on a `Tensor` object returned by `to(ScalarType::Float)` in the same line. This object will be destroyed at the end of the line leading to a dangling pointer.

On some platforms this manifests in wrong results being returned as the memory gets overwritten. On other platforms anything may happen due to this being undefined behavior, although most likely it will just crash or continue to return semi-random results which may even happen to be correct (when the memory is not reused yet)

Fix this by binding the temporary object (or initial object) to a const value reference which extents its lifetime and getting the `data_ptr` from that.

Fixes #84748

This bug was introduced by a seemingly unrelated change in #64081 hence ccing @d1jang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84750
Approved by: https://github.com/kimishpatel
2022-11-23 06:50:08 +00:00
07dd2fe6c3 Symintify select (#89326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89326
Approved by: https://github.com/ezyang
2022-11-23 05:00:33 +00:00
29742786f3 [quant] Add dequantize_per_channel in quantized_decomposed op library (#89269)
Summary:
att

Test Plan:
python test/test_quantization.py -k test_decomposed_dequantize_per_channel

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89269
Approved by: https://github.com/vkuzo
2022-11-23 04:25:25 +00:00
5266953443 Add crossref debug mode for functionalization, catches stride errors (#89498)
The idea is to add a custom handler to Functionalize key in Python
dispatcher that runs the functionalized version along side a non
functionalized version, and checks that their outputs agree in the
end.  (Technically, for metadata mutation we should also check the
inputs, but for now we're relying on those functions returning self.)
I turned this on for test_functionalize.py (new TestCrossRefFunctionalize)
and found a bunch of failures that look legit.

This probably doesn't interact that nicely if you're also tracing at
the same time, probably need more special logic for that (directly,
just disabling tracing for when we create the nested fake tensor mode,
but IDK if there's a more principled way to organize this.)

There are some misc fixups which I can split if people really want.

- xfail_inherited_tests moved to test common_utils
- Bindings for _dispatch_tls_set_dispatch_key_included,
  _dispatch_tls_is_dispatch_key_included and _functionalization_reapply_views_tls
- Type stubs for _enable_functionalization, _disable_functionalization
- all_known_overloads utility to let you iterate over all OpOverloads
  in all namespaces.  Iterator support on all torch._ops objects to let
  you iterate over their members.
- suspend_functionalization lets you temporarily disable functionalization mode
  in a context
- check_metadata_matches for easily comparing outputs of functions and see
  if they match (TODO: there are a few copies of this logic, consolidate!)
- _fmt for easily printing the metadata of a tensor without its data
- _uncache_dispatch for removing a particular dispatch key from the cache,
  so that we force it to regenerate
- check_significant_strides new kwarg only_cuda to let you also do stride
  test even when inputs are not CUDA
- Functionalize in torch._C.DispatchKey

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89498
Approved by: https://github.com/malfet
2022-11-23 04:18:25 +00:00
fe990c8db9 [BE] Add more ssh instructions (#89516)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89516
Approved by: https://github.com/huydhn
2022-11-23 03:31:17 +00:00
5b51ca6808 Update CUDA compiler matrix (#86360)
Switch GCC/Clang max versions to be exclusive as the `include/crt/host_config.h` checks the major version only for the upper bound. This allows to be less restrictive and match the checks in the aforementioned header.
Also update the versions using that header in the CUDA SDKs.

Follow up to #82860

I noticed this as PyTorch 1.12.1 with CUDA 11.3.1 and GCC 10.3 was failing in the `test_cpp_extensions*` tests.

Example for CUDA 11.3.1 from the SDK header:

```
#if __GNUC__ > 11
// Error out
...
#if (__clang_major__ >= 12) || (__clang_major__ < 3) || ((__clang_major__ == 3) &&  (__clang_minor__ < 3))
// Error out
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86360
Approved by: https://github.com/ezyang
2022-11-23 03:07:22 +00:00
504570d577 Delete unused variable assignment in _refs/__init__.py (#89538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89538
Approved by: https://github.com/huydhn
2022-11-23 02:59:25 +00:00
ed32511974 Don't use explain() for --explain; instead read it off the counters (#89518)
Fixes huggingface problem where example_inputs is not actually the
args.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89518
Approved by: https://github.com/albanD
2022-11-23 02:43:53 +00:00
f5d18574a3 Allow Module forward-pre and forward hooks to take kwargs (#89389)
closes #35643

This PR is mostly borrowed from #82042. Thanks @Padarn for implementing
the first version and debugging into the errors.

Based on the discussion in #82042 this PR adds a with_kwargs
argument to register_forward_pre_hook and register_forward_hook
methods. When the arg is set to true, the provided hook must accept
kwargs args. Under the hook, this PR adds a
`_forward_pre_hooks_with_kwargs` and a `_forward_hook_with_kwargs`
set to keep track of which hooks accept kwargs.

Differential Revision: [D41431111](https://our.internmc.facebook.com/intern/diff/D41431111)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89389
Approved by: https://github.com/soulitzer
2022-11-23 02:43:32 +00:00
4935b597ac Added implementation and tests for MPS Hardswish (#87952)
## What?
Fixes issue #86807 by adding MPS backend support for aten::hardswish.

## How?
Registered mps hardswish functions in native_functions.yaml, and added the code implementation to Activations.mm.

Added functions:
- hardswish_mps
- hardswish_mps_
- hardswish_backward_mps
- hardswish_out_mps

## Testing
Added test in test/test_mps.py and tested code using the command `python3 test/test_mps.py -k test_hardswish`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87952
Approved by: https://github.com/kulinseth, https://github.com/kit1980
2022-11-23 02:18:03 +00:00
1cfd3858ac [inductor] Use dense masks for indirect indexing (#89524)
Fixes https://github.com/pytorch/torchdynamo/issues/1654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89524
Approved by: https://github.com/jansel
2022-11-23 00:48:00 +00:00
26322544b8 Add limited FSDP correctness to torchdynamo benchmark (#89469)
- Does not do recursive wrapping
- Only supports accuracy bench
- Mainly useful for sweeping over models for correctness, in part
  to evaluate whether dynamo support for FSDP is breaking anywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89469
Approved by: https://github.com/davidberard98, https://github.com/aazzolini
2022-11-23 00:19:36 +00:00
7f4b4d2827 [Inductor] Limit g++12 installation to Linux (#89472)
According to https://anaconda.org/conda-forge/gxx/ its only available on Linux

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89472
Approved by: https://github.com/soumith, https://github.com/jgong5
2022-11-23 00:07:59 +00:00
b50699f247 Fix inductor fallback_random for dropout/rand_like (#89515)
- Avoid fx graph rewrite that replaces certain ops with ones using
  triton random
- Keep track of replacement ops using triton random, so it is possible
  to not disable all replacements when using fallback_random

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89515
Approved by: https://github.com/ngimel
2022-11-22 23:53:47 +00:00
8bf8e4d71e [dashboard] Add metric graphs back to dashboard (#89531)
Title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89531
Approved by: https://github.com/davidberard98
2022-11-22 23:42:09 +00:00
ce856cee7e [test_nn] fix missing class attributes for NNTestCase (#89200)
Missed setting these class variable 😓
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89200
Approved by: https://github.com/albanD
2022-11-22 22:55:44 +00:00
391b593ca2 [quant] Add quantize_per_channel in quantized_decomposed op library (#89268)
Summary:
att

Test Plan:
python test/test_quantization.py -k test_decomposed_quantize_per_channel

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89268
Approved by: https://github.com/vkuzo
2022-11-22 22:40:11 +00:00
5bba783d21 [dashboard] Remove aot_cudagraphs and nvprims_nvfuser (#89514)
Helps speeding up Dashboard runs

We will bring these back when the backends are ready to be tested on full model suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89514
Approved by: https://github.com/SherlockNoMad
2022-11-22 22:25:30 +00:00
ea920a1115 [Vulkan][TCC] Add tests for quantize_per_tensor and dequantize (#89496)
Summary: Add tests for quantize per tensor and dequantize

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: salilsdesai

Differential Revision: D41047097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89496
Approved by: https://github.com/digantdesai
2022-11-22 22:15:57 +00:00
74e62a1fef [ROCm] Optimize layer norm backward kernel for ROCm (#87635)
We observed that the native PyTorch LayerNormBackwardKernelImplInternal has suboptimal performance for certain input sizes on AMD GPUs especially when `fs`  (=`config_m` in our benchmark script) is large and `bs`  (=`config_n` in our benchmark script) is small (commonly seen in [the CvT model](https://arxiv.org/abs/2103.15808)) in the benchmark script of [PR #68238](https://github.com/pytorch/pytorch/pull/68238#issue-1051621716) on AMD GPUs.

This PR is to replace `GammaBetaBackwardCUDAKernel` with the Apex layernorm backward kernel with some ROCm-specific parameter tuning when `fs`  (=`config_m`) is larger than 512 on AMD GPUs.

There are a few PRs for LayerNorm kernel:
- https://github.com/pytorch/pytorch/pull/26201
- https://github.com/pytorch/pytorch/pull/27634
- https://github.com/pytorch/pytorch/pull/68238

Therefore, we have tested and compared the kernel before and at this PR with the input shapes in the last two PRs along with those commonly used in the CvT model on AMD MI100.

---
**Current**
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
-- | -- | -- | -- | -- | --
50432 | 384 | 0.387256 | 1.372758 | 0.378975 | 1.47892
50176 | 384 | 0.38231 | 1.362416 | 0.378084 | 1.473886
200704 | 192 | 0.997859 | 4.315875 | 0.989306 | 4.560827
802816 | 64 | 3.671828 | 16.68013 | 3.613515 | 16.827946
200 | 256 | 0.066503 | 0.332096 | 0.071422 | 0.325349
1000 | 256 | 0.071848 | 0.333355 | 0.073038 | 0.334753
6000 | 256 | 0.086334 | 0.345139 | 0.086834 | 0.347429
6272 | 256 | 0.088601 | 0.347906 | 0.087855 | 0.351245
200 | 512 | 0.071626 | 0.329726 | 0.073798 | 0.326878
1000 | 512 | 0.073975 | 0.330226 | 0.074166 | 0.332751
6000 | 512 | 0.099617 | 0.362367 | 0.100095 | 0.378313
6272 | 512 | 0.100378 | 0.358066 | 0.099857 | 0.395982
200 | 1024 | 0.072954 | 0.326382 | 0.073899 | 0.333007
1000 | 1024 | 0.0743 | 0.325532 | 0.071126 | 0.330991
6000 | 1024 | 0.127025 | 0.390084 | 0.128692 | 0.471504
6272 | 1024 | 0.130704 | 0.403536 | 0.135244 | 0.487133
200 | 1536 | 0.070331 | 0.339169 | 0.070086 | 0.331015
1000 | 1536 | 0.075085 | 0.330042 | 0.076295 | 0.328778
6000 | 1536 | 0.148889 | 0.44949 | 0.155781 | 0.659987
6272 | 1536 | 0.154939 | 0.478871 | 0.17673 | 0.716025
200 | 2048 | 0.070269 | 0.335585 | 0.072804 | 0.334655
1000 | 2048 | 0.080094 | 0.326991 | 0.080426 | 0.32685
6000 | 2048 | 0.187888 | 0.623023 | 0.245762 | 0.981635
6272 | 2048 | 0.195431 | 0.65244 | 0.262574 | 1.008141
200 | 3072 | 0.068205 | 0.339428 | 0.073068 | 0.344034
1000 | 3072 | 0.087554 | 0.328899 | 0.09218 | 0.346433
6000 | 3072 | 0.240352 | 0.905058 | 0.368135 | 1.280462
6272 | 3072 | 0.26179 | 0.959387 | 0.387782 | 1.476524
128 | 2097152 | 5.905976 | 22.724793 | 10.287974 | 30.242092
256 | 1048576 | 4.561596 | 19.554308 | 10.223171 | 29.42371
512 | 524288 | 4.146751 | 22.7247 | 11.404285 | 39.175902
1024 | 262144 | 5.193135 | 23.403325 | 11.334512 | 38.947192
2048 | 131072 | 4.992907 | 23.377801 | 11.400286 | 40.889191
4096 | 65536 | 5.429488 | 24.275701 | 11.196778 | 41.4751
8192 | 32768 | 5.35758 | 21.360312 | 10.535418 | 42.875646
16384 | 16384 | 5.44947 | 20.852605 | 10.357685 | 34.603408
32768 | 8192 | 4.688925 | 17.379392 | 9.635596 | 31.188271

</body>

</html>

---------
**At this PR**
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/hubertlu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{mso-header-data:"&L&\0022Arial\0022&10&K0000FF \[AMD Official Use Only - General\]&1\#\000D";
	margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:Calibri, sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl63
	{color:windowtext;}
-->
</head>

<body link="#0563C1" vlink="#954F72">

M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)
-- | -- | -- | -- | -- | --
50432 | 384 | 0.38797 | 0.93103 | 0.37966 | 1.15283
50176 | 384 | 0.3874 | 0.96417 | 0.38462 | 1.18595
200704 | 192 | 1.00002 | 2.40876 | 0.99224 | 2.55579
802816 | 64 | 3.67348 | 7.98658 | 3.61871 | 7.72404
200 | 256 | 0.07292 | 0.35119 | 0.07195 | 0.32602
1000 | 256 | 0.07354 | 0.33325 | 0.07237 | 0.33742
6000 | 256 | 0.08819 | 0.33283 | 0.08453 | 0.3279
6272 | 256 | 0.0886 | 0.33446 | 0.08774 | 0.33426
200 | 512 | 0.0701 | 0.33505 | 0.07072 | 0.33018
1000 | 512 | 0.07042 | 0.33442 | 0.074 | 0.33206
6000 | 512 | 0.09931 | 0.34956 | 0.09895 | 0.3572
6272 | 512 | 0.10103 | 0.32976 | 0.10041 | 0.36635
200 | 1024 | 0.07144 | 0.33579 | 0.07209 | 0.33216
1000 | 1024 | 0.0736 | 0.32803 | 0.07286 | 0.32936
6000 | 1024 | 0.12584 | 0.38916 | 0.12852 | 0.48273
6272 | 1024 | 0.13053 | 0.38804 | 0.13464 | 0.49545
200 | 1536 | 0.07159 | 0.3396 | 0.07062 | 0.33545
1000 | 1536 | 0.07443 | 0.33239 | 0.07366 | 0.33204
6000 | 1536 | 0.14959 | 0.45043 | 0.15826 | 0.69119
6272 | 1536 | 0.1542 | 0.47644 | 0.18249 | 0.72208
200 | 2048 | 0.07258 | 0.33982 | 0.07412 | 0.33859
1000 | 2048 | 0.0793 | 0.32816 | 0.07864 | 0.32583
6000 | 2048 | 0.18973 | 0.571 | 0.25506 | 0.91796
6272 | 2048 | 0.19719 | 0.64208 | 0.26445 | 0.95055
200 | 3072 | 0.07092 | 0.33867 | 0.07104 | 0.34695
1000 | 3072 | 0.08727 | 0.33144 | 0.09144 | 0.36633
6000 | 3072 | 0.24683 | 0.87275 | 0.37761 | 1.3289
6272 | 3072 | 0.26437 | 0.91178 | 0.38496 | 1.53694
128 | 2097152 | 6.27936 | 23.69425 | 10.40004 | 30.13699
256 | 1048576 | 4.5404 | 19.47675 | 10.28494 | 29.36936
512 | 524288 | 4.13951 | 18.78771 | 10.09557 | 32.67083
1024 | 262144 | 4.47576 | 18.00411 | 9.56488 | 31.47117
2048 | 131072 | 4.28026 | 16.95619 | 9.40297 | 30.82845
4096 | 65536 | 4.2653 | 16.5018 | 9.03315 | 30.08392
8192 | 32768 | 4.25613 | 16.13583 | 8.9258 | 30.75296
16384 | 16384 | 4.20256 | 16.38207 | 9.52587 | 31.31113
32768 | 8192 | 4.20231 | 16.19452 | 9.31478 | 31.03514

</body>

</html>

---------

**Performance Improvement (%)**
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=OneNote.File>
<meta name=Generator content="Microsoft OneNote 15">
</head>

<body lang=en-US style='font-family:Calibri;font-size:11.0pt'>
<!--StartFragment-->

<div style='direction:ltr'>

M | N | fwdbwd,   torch.float16 | fwdbwd,   torch.float32
-- | -- | -- | --
50432 | 384 | 32.178 | 22.049
50176 | 384 | 29.231 | 19.536
200704 | 192 | 44.188 | 43.962
802816 | 64 | 52.119 | 54.100
200 | 256 | -5.750 | -0.206
1000 | 256 | 0.031 | -0.797
6000 | 256 | 3.566 | 5.621
6272 | 256 | 3.865 | 4.836
200 | 512 | -1.615 | -1.010
1000 | 512 | -1.270 | 0.208
6000 | 512 | 3.534 | 5.581
6272 | 512 | 7.905 | 7.483
200 | 1024 | -2.883 | 0.254
1000 | 1024 | -0.767 | 0.493
6000 | 1024 | 0.237 | -2.381
6272 | 1024 | 3.840 | -1.707
200 | 1536 | -0.127 | -1.340
1000 | 1536 | -0.711 | -0.992
6000 | 1536 | -0.209 | -4.728
6272 | 1536 | 0.508 | -0.846
200 | 2048 | -1.262 | -1.176
1000 | 2048 | -0.358 | 0.312
6000 | 2048 | 8.350 | 6.487
6272 | 2048 | 1.588 | 5.713
200 | 3072 | 0.223 | -0.848
1000 | 3072 | -0.773 | -5.743
6000 | 3072 | 3.570 | -3.783
6272 | 3072 | 4.962 | -4.092
128 | 2097152 | -4.266 | 0.348
256 | 1048576 | 0.397 | 0.185
512 | 524288 | 17.325 | 16.605
1024 | 262144 | 23.070 | 19.195
2048 | 131072 | 27.469 | 24.605
4096 | 65536 | 32.023 | 27.465
8192 | 32768 | 24.459 | 28.274
16384 | 16384 | 21.439 | 9.514
32768 | 8192 | 6.818 | 0.491

</div>

<!--EndFragment-->
</body>

</html>

---------
**Benchmark script of this PR**
```
# Ref:
#       1. https://github.com/pytorch/pytorch/pull/26201
#       2. https://github.com/pytorch/pytorch/pull/68238

from distutils.command.config import config
import torch
from torch.nn import LayerNorm
import timeit

number_runs = 1000  # TODO: Modify this to save time!
def test_forward(layer_norm_cuda, input_cuda):
    layer_norm_cuda(input_cuda); torch.cuda.synchronize()

def test_backward(out_cuda, layer_norm_grad_cuda, create_graph):
    out_cuda.backward(layer_norm_grad_cuda, retain_graph=True, create_graph=create_graph); torch.cuda.synchronize()

def test_fwdbwd(input_cuda, layer_norm_cuda, gO):
    input_cuda.grad = None
    layer_norm_cuda.zero_grad(set_to_none=True)
    out = layer_norm_cuda(input_cuda)
    out.backward(gO)
    torch.cuda.synchronize()

def benchmark(config_m, config_n):

    print("M | N | fwd (half) | fwdbwd (half) | fwd (float) | fwdbwd (float)")
    if len(config_m) != len(config_n):
        print("Please make sure the lengths of config_m and config_m are the same.")

    for i in range(len(config_m)):
        normalized_shape = config_n[i]
        results = [config_m[i], config_n[i]]
        for dtype in (torch.half, torch.float):
            if dtype == torch.half:
                layer_norm_cuda = LayerNorm(normalized_shape).half().cuda()
            else:
                layer_norm_cuda = LayerNorm(normalized_shape).cuda()

            input_cuda = torch.randn(config_m[i], config_n[i], device='cuda', dtype=dtype, requires_grad=True)

            # print("cuda forward:")
            result_fwd = timeit.timeit(lambda: test_forward(layer_norm_cuda, input_cuda), number=number_runs)
            results.append(result_fwd / number_runs * 1000)

            gO = torch.rand_like(input_cuda)

            result_fwdbwd = timeit.timeit(lambda: test_fwdbwd(input_cuda, layer_norm_cuda, gO), number=number_runs)
            results.append(result_fwdbwd / number_runs * 1000)

        print('{:09d}|{:09d}|{:9.5f}|{:9.5f}|{:9.5f}|{:9.5f}'.format(results[0], results[1], results[2], results[3], results[4], results[5]))

    print("Times are in microseconds (us).")

# CVT
config_m_cvt = [50432, 50176, 200704, 802816]
config_n_cvt = [384, 384, 192, 64]

# https://github.com/pytorch/pytorch/pull/68238#issue-1051621716
config_m_68238 = [200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272, 200, 1000, 6000, 6272]
config_n_68238 = [256,256,256,256,512,512,512,512,1024,1024,1024,1024,1536,1536,1536,1536,2048,2048,2048,2048,3072,3072,3072,3072]

# https://github.com/pytorch/pytorch/pull/27634
config_m_27634 = [128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768]
config_n_27634 = [2097152, 1048576, 524288, 262144, 131072, 65536, 32768, 16384, 8192]

config_m = config_m_cvt + config_m_68238 + config_m_27634
config_n = config_n_cvt + config_n_68238 + config_n_27634

benchmark(config_m, config_n)
```

CC: @jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87635
Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/ezyang
2022-11-22 22:15:38 +00:00
00b7d8ef23 Shard windows periodic job more (#89455)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89455
Approved by: https://github.com/huydhn
2022-11-22 21:52:50 +00:00
77d7f2c659 [dashboard] Add commit date & fix date related issues (#89517)
Add commit date to build summary of dashboard. Make the date of the run reflective of when the run started, not when the run ended. Use PST (UTC -8) to determine day, rather than GMT (UTC +0).

Test comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1324176119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89517
Approved by: https://github.com/anijain2305
2022-11-22 21:17:36 +00:00
177baf366a Fix vectorized trigonometric functions for VSX (#86453)
Replace the remaining hand-written code in vec256_float_vsx.h by calls to Sleef functions similar to what was done in #59382 & #82646 after #41541

This fixes wrong results for e.g. `sin(1e20)`.
Fixes #85978

To fix #85978 I only needed to do the sin/cos functions to make the test pass but to not encounter the same issue again and again (see the previous PRs and issues) I checked the whole file for similar functions where a Sleef function could be used and changed those too. In the diff I've noticed the faulty whitespace so to make this complete I fixed that too, so it should now be done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86453
Approved by: https://github.com/malfet
2022-11-22 20:29:09 +00:00
ac3004757e Relax tolerance for test_out_addbmm_cpu_float32 (#86365)
The test may fail due to slightly different values caused by different order of matrizes in SGEMM:

> Mismatched elements: 1 / 50 (2.0%)
> Greatest absolute difference: 1.430511474609375e-05 at index (4, 5) (up to 1e-05 allowed)
> Greatest relative difference: 4.65393206065873e-06 at index (4, 5) (up to 1.3e-06 allowed)

Observed on POWER (ppc64le)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86365
Approved by: https://github.com/mruberry, https://github.com/kit1980
2022-11-22 20:27:29 +00:00
d053d51343 (Further) limit world size in test_fsdp_pure_fp16 (#86280)
Test still fails when run on 5 A100 GPUs, although it works with 5 V100s. Using 4 GPUs seems to be fine.

Followup to #85957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86280
Approved by: https://github.com/awgu, https://github.com/kit1980
2022-11-22 20:25:38 +00:00
c2ce79f06e Fix dev-discuss link in the maintainer docs (#89493)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89493
Approved by: https://github.com/H-Huang
2022-11-22 19:33:21 +00:00
ef8b91fec7 enable previously failing UCC distributed_test.py tests (#89023)
Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (https://github.com/pytorch/pytorch/pull/86961) or UCC-side timeout error handling fix:  (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89023
Approved by: https://github.com/kwen2501, https://github.com/malfet
2022-11-22 19:05:56 +00:00
f281f435a8 Fix benchmarks - xla tensor test (#89509)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89509
Approved by: https://github.com/ngimel, https://github.com/shunting314
2022-11-22 18:42:13 +00:00
7c0bb61291 Force numpy prod to use 64 bit integers on Windows in some tests (#88089)
This fixes some prod and masked.prod tests on Windows.

np.prod uses int32 on Windows so it overflows.

On Linux it uses by default int64.

Fixes #77305
Fixes #77320
Fixes #77334
Fixes #77335
Fixes #77336
Fixes #77337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88089
Approved by: https://github.com/mruberry
2022-11-22 18:37:14 +00:00
f4898daaee Add cached conda env file for Buck CI workflow (#89422)
Fixes - T137631262

Caching conda dependencies for build workflows.
Conda dependencies have been gathered from the workflow https://github.com/pytorch/pytorch/blob/master/.github/workflows/_buck-build-test.yml

The pull request updates the action from `conda-incubator/setup-miniconda@v2` to `pytorch/test-infra/.github/actions/setup-miniconda@main` as it supports caching.

Test Plan:

Running the `ciflow/periodic` which runs the ci builds `buck-build-test` workflow. Expected output is to have all the conda dependencies cached.

<img width="1227" alt="Screenshot 2022-11-22 at 15 44 20" src="https://user-images.githubusercontent.com/15447437/203343298-e55c384b-01ad-45c3-a5e9-ba5c53149be4.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89422
Approved by: https://github.com/huydhn
2022-11-22 18:00:01 +00:00
9c0bf9387c Meta impl for linalg_cholesky and linalg_cholesky_ex (#89430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89430
Approved by: https://github.com/ezyang
2022-11-22 17:05:34 +00:00
c4e08387c1 [quant][fx] Support producing reference quantized patterns for dynamic quantization (#89248)
Summary:
split the is_decomposed logic for `_replace_observer_with_quantize_dequantize_node` in a separate function and added support for dynamic quantization in the decomposed version of this function.

In case of dynamic quantization, we'll produce the following reference quantized pattern in decomposed mode:
```
x -> choose_qparams -> quantize_per_tensor -> dequantize_per_tensor -> linear
```

Test Plan:
python test/test_quantization.py -k test__convert_to_reference_decomposed_fx_dynamic_quant

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89248
Approved by: https://github.com/vkuzo
2022-11-22 16:45:13 +00:00
2823fc5e4c [inductor] generate nan in the cpp backend (#89289)
Summary: Fixes https://github.com/pytorch/torchdynamo/issues/1797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89289
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5
2022-11-22 15:54:04 +00:00
5797f74924 [19/N] Add monitored_barrier custom op with CPU implementation (#89318)
Differential Revision: [D41415324](https://our.internmc.facebook.com/intern/diff/D41415324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89318
Approved by: https://github.com/kwen2501
2022-11-22 14:18:40 +00:00
be22b5d39f [18/N] Add allgather_coalesced custom op with CPU/CUDA implementations (#89317)
Differential Revision: [D41415321](https://our.internmc.facebook.com/intern/diff/D41415321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89317
Approved by: https://github.com/kwen2501
2022-11-22 14:14:17 +00:00
d9cbe7764e Make aten.copy preserve strides (hf_Longformer) (#89464)
Fixes https://github.com/pytorch/torchdynamo/issues/1888

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D41460986](https://our.internmc.facebook.com/intern/diff/D41460986)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89464
Approved by: https://github.com/bdhirsh
2022-11-22 13:06:43 +00:00
2d94fd3b19 [Vulkan][TCC] Fix quantized shaders (#89456)
Summary: Fix rounding issue in quantized shaders

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: salilsdesai

Differential Revision: D41047095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89456
Approved by: https://github.com/kirklandsign, https://github.com/digantdesai
2022-11-22 11:05:58 +00:00
0f7dca1733 Vectorized CPU code implementing right shift operator. (#88990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88990
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2022-11-22 10:10:38 +00:00
1d6a188d08 Reland Dispatch torch.norm to linalg.vector_norm and linalg.matrix_norm (#81761) (#84624)
Reland https://github.com/pytorch/pytorch/pull/81761

Differential Revision: [D39332292](https://our.internmc.facebook.com/intern/diff/D39332292)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84624
Approved by: https://github.com/kit1980
2022-11-22 07:53:24 +00:00
6b085d5cad [Checkpoint][2D][2/N] Add traverse for distributed checkpoint to core distributed (#89398)
This PR moves traverse and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This is used when flatten nested dict and flatten sharded tensors.

Docstring and comments will be added in the following PRs.

Test:
```
python3 test/distributed/_tensor/parallel/test_2d_parallel.py
```
and CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89398
Approved by: https://github.com/wanchaol
2022-11-22 07:49:09 +00:00
7b0650d5cf Back out "[static-runtime] change the backend for permute_copy" (#89463)
Summary: This permute copy change seems to be causing huge regressions on machines without AVX512. Revert to mitigate. This shouldn't be problematic since the improvement from changing it was super small anyways.

Differential Revision: D41450088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89463
Approved by: https://github.com/hlu1
2022-11-22 06:26:10 +00:00
f2cf1b0f5e Revert submodule updates introduced by #89157 (#89449)
Reverts updates that were introduced by https://github.com/pytorch/pytorch/pull/89157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89449
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000
2022-11-22 05:48:43 +00:00
40cf214f2d Support masked_fill to address the GPT2 performance issue (#89274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89274
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-22 04:12:43 +00:00
e545caa50f dynamo/torchxla integration: trace on xla rather than eager (#88904)
In #87741 we added the inference support for dynamo/torchxla integration. Later on in #88449 we attempt to add the training support. That attempt is not smooth because
- we try 2 things together
   1. let dynamo trace the model on xla rather than eager
   2. enable training
- It turns out neither of these two tasks are trivial enough.

Furthermore, item 2 (enable training) depends on item 1 (tracing on xla). We enable training via AOTAutograd. AOTAutograd lift all model parameters/buffers as graph inputs. Without item 1 being done, we would need copy all graph inputs (including model parameters/buffers) from eager device to xla devices. That hurts performance a lot. Have a cache to map eager parameter to XLA parameter does not solve the problem since the update on either will not sync automatically to the other. They will easily go out of sync.

This PR let dynamo trace the model on XLA rather than eager. This is a preparation step to enabling training.

Also, tracing on XLA makes the data movement more efficient. We see 1.5x geomean speedup compared to previous 1.38x.
```
+-------------------------+--------------------+-------------------------+
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.38    |                 1.008   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.227   |                 0.998   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.544   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.085   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            2.028   |                 1.013   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.516   |                 0.995   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            0.868   |                 1.01    |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.099   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            3.26    |                 1.027   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            2.182   |                 1.015   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.50389 |                 1.01261 |
+-------------------------+--------------------+-------------------------+
```

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --only resnet18 --backend=torchxla_trace_once
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88904
Approved by: https://github.com/wconstab, https://github.com/JackCaoG, https://github.com/jansel
2022-11-22 03:57:04 +00:00
1dae59ba16 [Checkpoint][2D][1/N] Add dedup_tensors for distributed checkpoint to core distributed (#89399)
This PR moves dedup_tensors and its test to torch.distributed.checkpoint. This is a pre-req for enabling 2D checkpoint.

This removes duplicated shards in list of SavePlan. It is used when saving DT with replicated placement.

Docstring and comments will be added in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89399
Approved by: https://github.com/wanchaol
2022-11-22 03:52:35 +00:00
ce342ed2d3 Fix retrying logic for successful unittest tests under --rerun-disabled-tests mode (#89454)
When looking into Rockset data for disabled test unittest, for example `testAdd`, I see that it's re-run only 3 times instead of 50+ times as expected under rerun-disabled -test mode

```
[
  {
    "name": "testAdd",
    "classname": "TestLazyReuseIr",
    "filename": "lazy/test_reuse_ir.py",
    "flaky": false,
    "num_green": 3,
    "num_red": 0
  }
]
```

It turns out that I made a mistake mixing `RERUN_DISABLED_TESTS` and `report_only` into `(RERUN_DISABLED_TESTS or report_only) and num_retries_left < MAX_NUM_RETRIES` in https://github.com/pytorch/pytorch/pull/88646.  The retrying logic for successful tests under rerun-disabled-tests mode is never executed because num_retries_left would be equal to MAX_NUM_RETRIES (not smaller) if the very first run successes. Thus, the sample test `testAdd` finishes right away (1 success count)

* `report_only` and `RERUN_DISABLED_TESTS` are 2 different things and shouldn't be mixed together. RERUN_DISABLED_TESTS has the higher priority.
* We also don't want to retry skipped tests under rerun-disabled-tests mode because they are only skipped due to `check_if_enable` check `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run`

### Testing

* CI https://github.com/pytorch/pytorch/actions/runs/3518228784 generates https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3518228784/1/artifact/test-reports-test-default-4-4-linux.4xlarge.nvidia.gpu_9627285587.zip in which `testAdd` is correctly called multiple times and `TestLazyReuseIr` is skipped correctly
* Locally

```
# export CI=1
# export PYTORCH_RETRY_TEST_CASES=1
# export PYTORCH_OVERRIDE_FLAKY_SIGNAL=1
# export PYTORCH_TEST_RERUN_DISABLED_TESTS=1
$ python test/run_test.py --verbose -i lazy/test_reuse_ir
Ignoring disabled issues:  []
Selected tests:
 lazy/test_reuse_ir
Prioritized test from test file changes.
reordering tests for PR:
prioritized: []
the rest: ['lazy/test_reuse_ir']

Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/slow-tests.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-slow-tests.json
Downloading https://raw.githubusercontent.com/pytorch/test-infra/generated-stats/stats/disabled-tests-condensed.json to /Users/huydo/Storage/mine/pytorch/test/.pytorch-disabled-tests.json
parallel (file granularity) tests:
 lazy/test_reuse_ir
serial (file granularity) tests:

Ignoring disabled issues:  []
Ignoring disabled issues:  []
Running lazy/test_reuse_ir ... [2022-11-21 13:21:07.165877]
Executing ['/Users/huydo/miniconda3/envs/py3.9/bin/python', '-bb', 'lazy/test_reuse_ir.py', '-v', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2022-11-21 13:21:07.166279]

Expand the folded group to see the log file of lazy/test_reuse_ir
##[group]PRINTING LOG FILE of lazy/test_reuse_ir (/Users/huydo/Storage/mine/pytorch/test/test-reports/lazy-test_reuse_ir_6cf_dxa1)

Running tests...
----------------------------------------------------------------------
Test results will be stored in test-reports/python-unittest/lazy.test_reuse_ir
  testAdd (__main__.TestLazyReuseIr) ... ok (1.215s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 50
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 49
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 48
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 47
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 46
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 45
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 44
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 43
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 42
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 41
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 40
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 39
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 38
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 37
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 36
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 35
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 34
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 33
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 32
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 31
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 30
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 29
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 28
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 27
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 26
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 25
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 24
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 23
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 22
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 21
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 20
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 19
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 18
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 17
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 16
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 15
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 14
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 13
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 12
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 11
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 10
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 9
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 8
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 7
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 6
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 5
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 4
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 3
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 2
ok (0.001s)
  testAdd (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 1
ok (0.001s)
  testAddSub (__main__.TestLazyReuseIr) ...     testAdd succeeded - num_retries_left: 0
skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)
  testAddSubFallback (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)
  testBatchNorm (__main__.TestLazyReuseIr) ... skip: Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run (0.001s)

----------------------------------------------------------------------
Ran 54 tests in 1.264s

OK (skipped=3)
```

Here is the sample rockset query

```
WITH added_row_number AS (
  SELECT
    *,
    ROW_NUMBER() OVER(PARTITION BY name, classname, filename ORDER BY _event_time DESC) AS row_number
  FROM
    commons.rerun_disabled_tests
)
SELECT
  name,
  classname,
  filename,
  flaky,
  num_green,
  num_red
FROM
  added_row_number
WHERE
  row_number = 1
  AND name = 'testAdd'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89454
Approved by: https://github.com/clee2000
2022-11-22 03:39:17 +00:00
338f619044 [vision hash update] update the pinned vision hash (#89471)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89471
Approved by: https://github.com/pytorchbot
2022-11-22 03:38:56 +00:00
00b9473ad6 [PT-D][Tensor Parallelism][2/N] Sync TP API change to PT prod (#89467)
This is part of TP Beta Release efforts.
ref: https://github.com/pytorch/tau/issues/576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89467
Approved by: https://github.com/wanchaol
2022-11-22 03:05:53 +00:00
82713a1cc4 [inductor][compilation time] Fallback when kernel size for avg/max pool is large (#89448)
This fixes compilation time for yolov3 from 400 seconds to 48 seconds. yolov3 has a 13x13 max_pool2d kernel, which was creating really large Triton code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89448
Approved by: https://github.com/ngimel
2022-11-22 02:23:24 +00:00
496c8ae760 [xnnpack][lite-int] Handle Constant Data (#89445)
Handling constant data for xnnpack delegation. This allows us to handle new modules like such:

```
class Module(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self._constant = torch.ones(4, 4, 4)

            def forward(self, x):
                return x + self._constant
```

this is the precursor work to handling convolution, as we need to serialize constant data(weights)

Differential Revision: [D41050349](https://our.internmc.facebook.com/intern/diff/D41050349/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89445
Approved by: https://github.com/digantdesai
2022-11-22 02:20:54 +00:00
120d200620 Revert "Added conv constraint that infers layouts (#89031)" (#89451)
This reverts commit 716f70f19a4b63268da2a753afdbe9b385a831ab.

Fixes performance regression and compilation latency increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89451
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-22 02:20:50 +00:00
06dffb3319 dont clone symints, dont clobber symint proxies (#88230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88230
Approved by: https://github.com/albanD
2022-11-22 01:37:43 +00:00
58a74f34f9 [17/N] Add _reduce_scatter_base custom op with CPU/CUDA implementation (#88903)
Differential Revision: [D41415325](https://our.internmc.facebook.com/intern/diff/D41415325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88903
Approved by: https://github.com/kwen2501
2022-11-22 00:42:11 +00:00
7174572b1e Add torchvis support to dist bench (#89324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89324
Approved by: https://github.com/davidberard98, https://github.com/albanD
2022-11-22 00:41:33 +00:00
57ed94804e Bind DispatchKey.Functionalonalize in pybind11 (#89452)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89452
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-11-22 00:32:30 +00:00
b189a7444d [fix] tril & tril : out of bound check (#89384)
Fixes #83326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89384
Approved by: https://github.com/ngimel
2022-11-22 00:15:34 +00:00
dbc354b262 Mitigate flaky test_ops_fwd_gradients on macOS (#89410)
This has been flaky on macOS for a while ([hud](https://hud.pytorch.org/failure/RuntimeError%3A%20test_ops_fwd_gradients%20failed)) and I can reproduce this locally. The issue was raised by https://github.com/pytorch/pytorch/issues/66033 and it seems to point to macos itself https://github.com/graphia-app/graphia/issues/33.  So switching to single thread when running `test_ops_fwd_gradients` on macOS as a mitigation for the flaky tests.

### Testing

`pytest test_ops_fwd_gradients.py -k test_fn_fwgrad_bwgrad -vv --flake-finder` to run all `test_fn_fwgrad_bwgrad` tests 50 times to make sure they all pass (no flaky anymore)

https://hud.pytorch.org/tests shows that `test_ops_fwd_gradients` on macOS takes about 15m to finish or 8 minute if using 2 shards like in the test.  There is no obvious difference in the test duration:

```
2022-11-21T21:34:18.6078080Z Running test_ops_fwd_gradients ... [2022-11-21 21:34:18.600663]
2022-11-21T21:34:21.6805770Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.680156]
2022-11-21T21:34:21.6806380Z Ignoring disabled issues:  []
2022-11-21T21:34:21.6815250Z Executing ['/Users/runner/work/_temp/conda_environment_3517515737/bin/python', '-bb', 'test_ops_fwd_gradients.py', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_', '--import-slow-tests', '--import-disabled-tests'] ... [2022-11-21 21:34:21.681174]
2022-11-21T21:34:21.6815830Z Ignoring disabled issues:  []
.....
2022-11-21T21:40:42.2422700Z =============================== warnings summary ===============================
.....
2022-11-21T21:40:42.2424670Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-47b619449ea7db1f.xml -
2022-11-21T21:40:42.2424850Z = 831 passed, 596 skipped, 5 deselected, 17 xfailed, 1 warning in 374.54s (0:06:14) =
.....
2022-11-21T21:42:00.1923310Z =============================== warnings summary ===============================
.....
2022-11-21T21:42:00.1925370Z - generated xml file: /Users/runner/work/pytorch/pytorch/test/test-reports/python-pytest/test_ops_fwd_gradients/test_ops_fwd_gradients-d24ee6419a602a6e.xml -
2022-11-21T21:42:00.1925540Z = 828 passed, 603 skipped, 7 deselected, 20 xfailed, 1 warning in 452.94s (0:07:32) =
....
2022-11-21T21:42:09.9035670Z FINISHED PRINTING LOG FILE of test_ops_fwd_gradients (/Users/runner/work/pytorch/pytorch/test/test-reports/test_ops_fwd_gradients_ha_3rfhb)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89410
Approved by: https://github.com/soulitzer
2022-11-22 00:13:38 +00:00
ea50549ce6 Suppress guards when creating fake tensors (#89349)
When we create fake tensors, we may call operators that introduce
guards, to accurately reconstruct views.  But these guards are spurious:
if a user is able to present a tensor that "looks the same", they have
implicitly fulfilled the contract that the view is creatable.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89349
Approved by: https://github.com/voznesenskym
2022-11-21 23:14:20 +00:00
fa4980cd5e Add commit hash to dynamo dashboard (#89462)
Title - also fix a small bug with dashboard outputs.

Sample: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1322732698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89462
Approved by: https://github.com/anijain2305
2022-11-21 22:56:13 +00:00
186192bb26 [Dynamo] Fix bugs when calling tensor.data and tensor.layout (#89257)
Fix bugs in [7k github models](https://github.com/pytorch/torchdynamo/issues/1884).
* Legacy code still use ```tensor.data```, I think we can use ```tensor.detach``` to rewrite, not sure if there is anything I didn't anticipate.
* Support ```tensor.layout```.

The root cause of these issues are: dynamo wraps unimplemented ```tensor.x``` call into ```GetAttrVariable(TensorVariable, x)```, but this op was not inserted into FX graph. Hence, during the fake tensor propagation, it throws ```KeyError: 'example_value` ```.

For these two popular attributes, Dynamo should support them anyway. However, if dynamo should support ___all___ ```tensor.x``` call and not fallback to ```GetAttrVariable```, I think it's debatable.
If I turn off fake tensor propagation, it works well even not including this fix. So I'm curious if we should improve the fake propagation to cover similar cases. cc @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @jansel @eellison

```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 404, in _compile
    out_code = transform_code_object(code, transform)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/bytecode_transformation.py", line 341, in transform_code_object
    transformations(instructions, code_options)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 392, in transform
    tracer.run()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 1523, in run
    super().run()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 389, in run
    and self.step()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 359, in step
    getattr(self, inst.opname)(inst)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 193, in wrapper
    return inner_fn(self, inst)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 865, in CALL_FUNCTION_KW
    self.call_function(fn, args, kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 301, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/torch.py", line 407, in call_function
    tensor_variable = wrap_fx_proxy(
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/builder.py", line 636, in wrap_fx_proxy
    return wrap_fx_proxy_cls(
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/builder.py", line 676, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1024, in get_fake_value
    args, kwargs = torch.fx.node.map_arg((node.args, node.kwargs), visit)
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 613, in map_arg
    return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x)
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 621, in map_aggregate
    t = tuple(map_aggregate(elem, fn) for elem in a)
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 621, in <genexpr>
    t = tuple(map_aggregate(elem, fn) for elem in a)
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 627, in map_aggregate
    return immutable_dict((k, map_aggregate(v, fn)) for k, v in a.items())
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 627, in <genexpr>
    return immutable_dict((k, map_aggregate(v, fn)) for k, v in a.items())
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 631, in map_aggregate
    return fn(a)
  File "/scratch/ybliang/work/repos/pytorch/torch/fx/node.py", line 613, in <lambda>
    return map_aggregate(a, lambda x: fn(x) if isinstance(x, Node) else x)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/utils.py", line 1022, in visit
    return n.meta["example_value"]
KeyError: 'example_value\n\nfrom user code:\n   File "./generated/test_BayesWatch_pytorch_prunes.py", line 108, in forward\n    return torch.zeros([x.size()[0], self.channels, x.size()[2] // self.spatial, x.size()[3] // self.spatial], dtype=x.dtype, layout=x.layout, device=x.device)\n\nSet torch._dynamo.config.verbose=True for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n    torch._dynamo.config.suppress_errors = True\n'

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89257
Approved by: https://github.com/jansel
2022-11-21 22:44:01 +00:00
821ba6b51b [4/n] Thread PG: add reduce_scatter to threaded pg (#89442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89442
Approved by: https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:44 +00:00
3e99d4db76 [3/n] Thread PG: add scatter to threaded pg (#89441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89441
Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:44 +00:00
3876f94c3d [2/n] Thread PG: add test for broadcast (#89440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89440
Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:42 +00:00
deae450899 [1/n] Thread PG: add test for allgather (#89439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89439
Approved by: https://github.com/XilunWu, https://github.com/yhcharles, https://github.com/fduwjj
2022-11-21 22:36:41 +00:00
047e542a1a [tools] expose selective build library (#89351)
Change the base module and visibility of `tools:gen_oplist_lib` so that it can be reused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89351
Approved by: https://github.com/cccclai
2022-11-21 21:08:13 +00:00
c068fa900f [inductor] Misc division lowering fixes (#88603)
1. `aten.div.Tensor_mode` should allow broadcasting
2. `div` can use `ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT`
3. `prims.div` on integers should be truncating division
4. Add lowering for `true_divide` which is aliased to `div`
5. register lowering for inplace version of `div_mode`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88603
Approved by: https://github.com/ngimel
2022-11-21 20:56:41 +00:00
1267dcf297 [inductor] Fix nan handling for aten.sign (#88937)
ATen gives `sign(nan) == 0` but inductor's cuda codegen would give
`sign(nan) == 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88937
Approved by: https://github.com/ngimel
2022-11-21 20:56:40 +00:00
3d247a8bcd Fix unconvertible_ops as per #89261 (#89299)
Fixes #89261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89299
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-21 20:40:04 +00:00
1d9e1fca97 Update sdp dispatch logic to enable fused backward (#89154)
# Summary
Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154
Approved by: https://github.com/cpuhrsch
2022-11-21 20:02:09 +00:00
cf9476554f update kineto pinned commit (#89435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89435
Approved by: https://github.com/malfet
2022-11-21 17:32:29 +00:00
e4d9dbd7d2 Port torchdynamo's torchbench script to userbenchmark (#89239)
Summary:
This Diff ports the torchbench.py script from torchdynamo to torchbench to support the development of internal models.

Currently, only works with the `--only` option, and can only test one model at a time.

Note that the noisy logs are from upstream model code, not the benchmark code.
In the internal environment, `torch._dynamo.config.base_dir` is not writable, so we add an option to specify the output directory.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only ads_dhen_5x --part over --output-directory /tmp/tb-test/
cuda eval  ads_dhen_5x
  1/  1 +0 frames   2s  1 graphs  1 graph calls  412/ 411 = 100% ops 100% time
```

```
$  buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --performance --only cmf_10x --part over --output-directory /tmp/tb-test/
cuda eval  cmf_10x
  1/  1 +0 frames   1s  1 graphs  1 graph calls  306/ 305 = 100% ops 100% time
```

Reviewed By: jansel

Differential Revision: D41294311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89239
Approved by: https://github.com/jansel
2022-11-21 17:25:28 +00:00
9d209e7834 Revert "[ao] making _is_activation_post_process private (#87520)"
This reverts commit 45c62a337756ff9db97cd64d2d42d9e65dda0a85.

Reverted https://github.com/pytorch/pytorch/pull/87520 on behalf of https://github.com/bigfootjon due to Diff reverted internally
2022-11-21 16:48:26 +00:00
f3db03612f Revert "[ao] maintain BC for is_activation_post_process (#89260)"
This reverts commit c5fafb4e1694f141d8a1a31142cce4049d9057ed.

Reverted https://github.com/pytorch/pytorch/pull/89260 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2022-11-21 16:38:20 +00:00
6796979ee1 [Inductor] Limit the number of compile threads to the available cpu cores (#89377)
`config.compile_threads` gets the number of compile threads via `min(32,os.cpu_count())` while `os.cpu_count()` is the total number of cpu cores in the system, not the available ones. This would cause compile thread contention when the available cpu cores are less than `min(32,os.cpu_count())`, e.g., available cpu cores are limited with numactl or taskset, making the compilation very slow. This PR tries to use `len(os.sched_getaffinity(0))` if `os.sched_getaffinity` is available which returns the available number of cpu cores.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89377
Approved by: https://github.com/soumith
2022-11-21 14:20:36 +00:00
c2cf0bde1f Move the OpInfo same-storage error to the autograd test (#88306)
This check was previously located at the `non_contiguous` test (quite
and odd location). Even more, at https://github.com/pytorch/pytorch/pull/86378#discussion_r993658395, Kshiteej found that this assert was not doing anything really.

We move it to the autograd test and make it a proper `self.assert`. We also disallow returning 1-tuples from sample_input functions, as they were breaking this assert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88306
Approved by: https://github.com/mruberry
2022-11-21 13:59:03 +00:00
a80e5e7813 Update ideep for future performance improvement (#87966)
**Summary**
The update includes API changes and optimzations to reduce framework overhead, which will benefit all mkldnn (onednn) ops in JIT mode and inductor CPU backend, etc. These benefits will be seen after switching to new ideep API by future PRs.

**Test plan**
For correctness, all UTs that call mkldnn ops, including test_ops.py, test_mkldnn*.py, test_quantization.py, etc.
For performance, TorchBench has been run and no regression is found. Results are shown below.
- Intel (R) Xeon (R) IceLake with 40 cores
- Use multi-instance
- Using tcmalloc & Intel OMP

![image](https://user-images.githubusercontent.com/12522207/201631004-bb77468d-953b-4757-a001-94d44615b5f6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87966
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper
2022-11-21 09:52:36 +00:00
31708a7310 TorchDynamo: enable conv+silu fusion (#89278)
This PR will improve the tf_efficientnet_b0 performance by fusing conv+silu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89278
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-21 09:35:53 +00:00
bc716383a6 Redefine the simdlen semantic (#89263)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89263
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-21 09:08:16 +00:00
79770d3636 TorchDynamo: enable conv+relu6 fusion (#89265)
This PR is about enabled conv+relu6 which improves mobilenet'e performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89265
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-21 08:01:07 +00:00
e0251de42f [Easy] Use prepend arg to register forward hooks in quantize.py (#89391)
Differential Revision: [D41431110](https://our.internmc.facebook.com/intern/diff/D41431110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89391
Approved by: https://github.com/awgu
2022-11-21 05:19:47 +00:00
1db5ce095f [vision hash update] update the pinned vision hash (#89287)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89287
Approved by: https://github.com/pytorchbot
2022-11-21 03:08:33 +00:00
51e961dd7b use std/libdevice erf in inductor (#89388)
By itself, libdevice version of erf has the same perf as our decomposition, but in real workloads it leads to better fusion groups (due to fewer ops in the fused kernel).
Bonus: a few fp64 test skips removed, because our decomposition wasn't accurate enough for fp64, but libdevice version is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89388
Approved by: https://github.com/jansel
2022-11-21 00:58:03 +00:00
1856fa5df7 Temporary increase ASAN shard 5 to 4xlarge (#89387)
ASAN shard 5 also see OOM now 7b0d577c22, may be we should increase all 5 of them to 4xlarge until https://github.com/pytorch/pytorch/issues/88309 is resolved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89387
Approved by: https://github.com/kit1980
2022-11-20 23:36:50 +00:00
e1d58b1928 Revert "Update sdp dispatch logic to enable fused backward (#89154)"
This reverts commit 2e72ec79823111e8dd8c5e82c5d1b56197cd52d3.

Reverted https://github.com/pytorch/pytorch/pull/89154 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_sdp_math_gradcheck test breaks periodic slow gradcheck, i.e. 419ef2cdcf
2022-11-20 22:14:38 +00:00
c09929659c Also include MKL_THREAD_LIB in link libraries for caffe2::mkl (#89378)
Actually fixes https://github.com/pytorch/audio/issues/2784 for
real; in my previous testing I didn't check if I could import
torchaudio; now torchaudio successfully imports.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89378
Approved by: https://github.com/soumith
2022-11-20 19:47:25 +00:00
7b0d577c22 Set INTERFACE_LINK_DIRECTORIES on caffe2::mkl (#89359)
This ensures that subsequent link commands involving mkl libraries
know where to find the libraries if they are in a non-standard
location (which is the case if you installed mkl via conda, which
is what our standard instructions recommend.)

This is kind of a hack, because the MKL libraries are not actually
guaranteed to be in $MKL_ROOT/lib (they are for the conda install
though).  The real fix is to properly use the MKL targets from
FindMKL.cmake but thats its own can of fish.  See
https://github.com/pytorch/pytorch/issues/73008

This fixes https://github.com/pytorch/audio/issues/2784

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89359
Approved by: https://github.com/soumith
2022-11-20 13:34:30 +00:00
dbeacf1182 Fix cat striding in PrimTorch (#89332)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89332
Approved by: https://github.com/ngimel
2022-11-20 04:05:33 +00:00
caf3d5319f Symintify numel(), infer_size, prims.elementwise_meta (#88956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88956
Approved by: https://github.com/ezyang
2022-11-20 00:42:03 +00:00
7c811efab7 Add support for dynamic kwarg to torch._dynamo.optimize (#89290)
This is an easier way to enable dynamic shapes for a region.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89290
Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/voznesenskym
2022-11-19 23:51:02 +00:00
8ad39536d7 Revert "Symintify numel(), infer_size, prims.elementwise_meta (#88956)"
This reverts commit ce2f8700bafcf44850402a39188ec121ba8b5486.

Reverted https://github.com/pytorch/pytorch/pull/88956 on behalf of https://github.com/ezyang due to somehow breaks torch.numel
2022-11-19 21:47:55 +00:00
8ac58bc2e3 Add nullptr_t overload to c10::intrusive_ptr (#89196)
__What?__

Fixes #82413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89196
Approved by: https://github.com/ezyang
2022-11-19 21:40:07 +00:00
5582001bd5 Reland 2 "Towards unifying symbolic and non symbolic fake tensor (#89038) (#89143)" (#89346)
This reverts commit 8e4c9828f4c990f439179912159086aaed790493.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89346
Approved by: https://github.com/wconstab
2022-11-19 21:14:31 +00:00
6afe341276 [PT-D][1/N] Sync TP Beta change to prod (#89242)
This is part of TP Beta Release efforts.

ref: https://github.com/pytorch/tau/issues/576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89242
Approved by: https://github.com/wanchaol
2022-11-19 18:01:25 +00:00
6b8c1b19b5 RM expectedFailure UnspecReproTests.test_batch_norm_act_unspec (#89340)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89340
Approved by: https://github.com/bertmaher
2022-11-19 17:49:39 +00:00
6daf60be5a [ONNX] Add setType from user into InferredType and Reliable in ConstantValueMap (#88622)
`setType` API is not respected in current exporter because the graph-level shape type inference simply overrides every NOT ONNX Op shape we had from node-level shape type inference. To address this issue, this PR (1) makes custom Op with `setType` **reliable** in ConstantValueMap to secure its shape/type information in pass:  _C._jit_pass_onnx. (2) If an invalid Op with shape/type in pass: _C._jit_pass_onnx_graph_shape_type_inference(graph-level), we recognize it as reliable.

1. In #62856, The refactor in onnx.cpp made regression on custom Op, as that was the step we should update custom Op shape/type information into ConstantValueMap for remaining Ops.

2. Add another condition besides IsValidONNXNode for custom Op setType in shape_type_inference.cpp. If all the node output has shape (not all dynamic), we say it's custom set type.

3. ~However, this PR won't solve the [issue](https://github.com/pytorch/pytorch/issues/87738#issuecomment-1292831219) that in the node-level shape type inference, exporter invokes the warning in terms of the unknow custom Op, since we process its symbolic_fn after this warning, but it would have shape/type if setType is used correctly. And that will be left for another issue to solve. #84661~ Add `no_type_warning` in UpdateReliable() and it only warns if non ONNX node with no given type appears.

Fixes #81693
Fixes #87738

NOTE: not confident of this not breaking anything. Please share your thoughts if there is a robust test on your mind.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88622
Approved by: https://github.com/BowenBao
2022-11-19 17:16:59 +00:00
940959ebbf [quant][fix] Add quant_min/quant_max for default dynamic quantization observer (#89267)
Summary:
This is needed for choose qparams, but previously it is not configurable, and in the reference quantization flow
with decomposed Tensor, we are making this explicit

Test Plan:
tested in future PR

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89267
Approved by: https://github.com/vkuzo
2022-11-19 16:08:31 +00:00
808bdbab89 Fix try/except flow where DataDependentOutputException is getting wrapped in a RuntimeError (#89314)
Repro fixed

```
def fn(a):
    return a.repeat_interleave(14, dim=0).repeat_interleave(14, dim=1)

x = torch.ones(14, 14).to(dtype=torch.int64)
opt_fn = torch._dynamo.optimize("eager")(fn)
opt_fn(x)
```

Fixes [#1886](https://github.com/pytorch/torchdynamo/issues/1886)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89314
Approved by: https://github.com/anijain2305, https://github.com/eellison
2022-11-19 07:16:29 +00:00
419ef2cdcf Added utility to count memory reads/written in Inductor (#89203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89203
Approved by: https://github.com/jansel, https://github.com/ngimel
2022-11-19 04:18:26 +00:00
7a2930b357 add jvp test with non-contig inputs (#89131)
Ref: https://github.com/pytorch/functorch/issues/1029

We update `test_jvp` to do contiguous and non-contiguous testing in a single test.

Prev time for `test_jvp` : ~28s
New time for `test_jvp`: ~45s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89131
Approved by: https://github.com/zou3519
2022-11-19 04:09:29 +00:00
631baecbcd Add --explain flag to bench (#89316)
TORCHDYNAMO_DYNAMIC_SHAPES=1 AOT_DYNAMIC_SHAPES=1 time python benchmarks/dynamo/torchbench.py  --accuracy --explain  --backend aot_eager --train --only BERT_pytorch

Dynamo produced 76 graphs with 75 graph break and 198 ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89316
Approved by: https://github.com/ezyang
2022-11-19 03:35:09 +00:00
e6996ea172 Don't redefine __STDC_FORMAT_MACROS (#89310)
Similar to https://github.com/pytorch/pytorch/pull/39608 and https://github.com/pytorch/pytorch/pull/6676

This causes a compile error in our internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89310
Approved by: https://github.com/kit1980
2022-11-19 02:24:21 +00:00
8c0515dbff cast C++ py-bound SymNode to SymInt correctly (#89295)
Unfortunately, it's a bit hard to test purely on the Pytorch core side, but it passes the XLA tests which are currently disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89295
Approved by: https://github.com/ezyang
2022-11-19 02:18:05 +00:00
2e72ec7982 Update sdp dispatch logic to enable fused backward (#89154)
# Summary
Reorganizes how the sdp dispatch logic is down in order to enable backwards for fused kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89154
Approved by: https://github.com/cpuhrsch
2022-11-19 02:06:27 +00:00
85a87e635c [dynamo] mutable local caching to make dynamo faster at tracing mutation (#89170)
Make mutation faster to speed up tracing optimizers, helps with https://github.com/pytorch/torchdynamo/issues/1803

`replace_all` no longer iterates over the entire variable tracker data structure  every time a mutation is performed

Each variable tracker internally keeps a set of contained mutable variable trackers, to provide a hint to `replace_all`. This is populated with a call to `apply` from `__post_init__` in the base `VariableTracker`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89170
Approved by: https://github.com/jansel
2022-11-19 01:47:48 +00:00
ea58955dda Move bazel to c++17 (#89297)
Splitting out various smaller pieces from https://github.com/pytorch/pytorch/pull/85969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89297
Approved by: https://github.com/huydhn
2022-11-19 01:13:08 +00:00
cad5772c2c [dashboard][huggingface] skip accuracy checks for really large models… (#89273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89273
Approved by: https://github.com/desertfire
2022-11-19 00:22:45 +00:00
ee907375fa [small] Update error message (#89294)
Summary:
`RuntimeError: Invalid function argument. Expected parameter "tensor_list" to be of type List[torch.Tensor].`

to

`RuntimeError: Invalid function argument. Expected parameter "input_tensor_list" to be of type List[torch.Tensor].`

Test Plan: sandcastle

Differential Revision: D41405238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89294
Approved by: https://github.com/awgu
2022-11-19 00:21:14 +00:00
c3938bb97a [functorch] introduce an experimental map() op. (#88767)
Summary:
We want to introduce an experimental control flow op: map() to export some models as FX graphs correctly.

Some calrification on basic requirements we have in mind:
1. This op can nest cond() and other control flow primitives internally.
2. We don't necessarily need loop carried dependencies for the models we've seen.
3. This map() op can handle dynamically shaped tensor as input and return dynamically shaped output based on input shapes.
4. We should be able to pass through additional arguments to the loop body as extra arguments.

In this diff we introduce a new control flow op `map()` which has the following semantics:
```
def map(f: Callable, xs: Tensor, *args):
    # one possible implementation:
    return torch.stack([f(x, *args) for x in xs])
```

Test Plan:
pytest functorch/test_control_flow.py
CI

Differential Revision: D41165796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88767
Approved by: https://github.com/zou3519
2022-11-19 00:19:50 +00:00
94b5c807fd Detach fake tensors into val, so they aren't affected by metadata mutation (#89140)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89140
Approved by: https://github.com/bdhirsh
2022-11-19 00:08:14 +00:00
885f8a56d4 [BE] Print backtraces from coredumps (#89309)
By simply invoking `gdb python core -ex "bt" -ex "q"`

Test plan:
 See: [linux-focal-py3.7-gcc7 / test (default, 1, 2, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/3500498821/jobs/5863369649#step:14:39)
Not sure why multiprocessing tests SEGFAULT, but they do
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89309
Approved by: https://github.com/clee2000, https://github.com/huydhn
2022-11-18 23:44:57 +00:00
0e1fcc8aa8 [FX] Add type annotation to getitem node before split_module (#88510)
Summary: Some nodes lost the type annotation during `split_module`, causing the submodels to be un-scriptable. This is because compiler always infer Tensor type, which is wrong for non-Tensor types. We attempt to infer type annotation for `getitem` node to improve scriptability.

Test Plan:
```
buck2 test //caffe2/test:fx_experimental
```

Differential Revision: D41037819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88510
Approved by: https://github.com/xush6528
2022-11-18 23:19:14 +00:00
ecfb4e064c [Inductor CI] Use string format for cuda-arch-list input to prevent 8.0/9.0/10.0 etc from being interpreted as 8/9/10 (#89279)
Currently or in future whenever we change the cuda-arch-list to num.0, github action or some agent would pass just num to TORCH_CUDA_ARCH_LIST

This num is not regex matched during cuda arch analysis phase. (here: c5fafb4e16/cmake/Modules_CUDA_fix/upstream/FindCUDA/select_compute_arch.cmake (L229))
Example failure: https://github.com/weiwangmeta/pytorch/actions/runs/3495656108/jobs/5852735299
  Unknown CUDA Architecture Name 8 in CUDA_SELECT_NVCC_ARCH_FLAGS
This change reminds us to use e.g. '8.0', '9.0', '10.0' etc instead of 8.0, 9.0, 10.0 as GHA or some other agent may erroneously truncate it to pure numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89279
Approved by: https://github.com/desertfire, https://github.com/atalman
2022-11-18 23:05:50 +00:00
7551136b81 Add NVTX markers that dump additional information for nvprim_nvfuser Dynamo graphs (#88259)
dump information on graphs that NVFuser JIT compiles:
- the markers show the list of ops, args, and inputs that make up the graph

also dumps information on FX nodes that are not touched by NVFuser:
- the markers show the op, name, and arg list of the node

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88259
Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry
2022-11-18 22:36:08 +00:00
35d5fc52f0 [Profiler] Don't raise SOFT_ASSERT in debug builds. (#89240)
Enough people are hitting this issue that we need to turn off hard failures until the fire rate is zero in steady state. (via scuba logging.)

Differential Revision: [D41382914](https://our.internmc.facebook.com/intern/diff/D41382914/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89240
Approved by: https://github.com/aaronenyeshi
2022-11-18 22:24:24 +00:00
bfffc8d8ef [DDP][Docs] Add warning that no_sync() should include forward (#89244)
The issue where the user only includes `loss.backward()` inside `no_sync()` but not the forward pass has arisen several times now. I think adding an explicit warning in the docs is worthwhile.

Rendered doc:
<img width="769" alt="Screen Shot 2022-11-17 at 9 21 32 PM" src="https://user-images.githubusercontent.com/31054793/202602005-22c000b7-1093-4eaf-ba66-9c929a66906b.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89244
Approved by: https://github.com/zhaojuanmao
2022-11-18 22:06:24 +00:00
304b5de1b0 Re-enable test_hf_bert_fsdp (#89223)
It looks like this failure was actually caused by https://github.com/pytorch/pytorch/pull/88629, see the revert message on that PR. It probably just looked like a flaky test on CI because of how quickly the PR was reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89223
Approved by: https://github.com/voznesenskym
2022-11-18 21:40:27 +00:00
ba605c3b04 Don't trace when we track_tensor_tree (#89139)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89139
Approved by: https://github.com/bdhirsh
2022-11-18 20:15:20 +00:00
e04dc35a6a Symintify obeys_layout_contract (#89138)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89138
Approved by: https://github.com/bdhirsh
2022-11-18 20:15:20 +00:00
837ca8f344 Remove --retry-all-errors from environment with old curl (#89298)
The version of curl on the `ubuntu-latest` box doesn't support the `--retry-all-errors` param and is breaking periodic builds

Example: https://github.com/pytorch/pytorch/actions/runs/3495466804/jobs/5852265880
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89298
Approved by: https://github.com/huydhn
2022-11-18 19:36:09 +00:00
ee2ce3fef6 Set make max load when building libtorch (#89237)
The nccl build is still OOM sometimes when using `$(MAKE)`:

```
virtual memory exhausted: Cannot allocate memory
Makefile:73: recipe for target '/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o' failed
make[5]: *** [/var/lib/jenkins/cpp-build/caffe2/build/nccl/obj/collectives/device/devlink.o] Error 1
make[5]: Leaving directory '/var/lib/jenkins/workspace/third_party/nccl/nccl/src/collectives/device'
```

* https://github.com/pytorch/pytorch/actions/runs/3476485191/jobs/5811758058
* https://github.com/pytorch/pytorch/actions/runs/3422228421/jobs/5702153639

So trying to set the same limit here as when building with ninja

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89237
Approved by: https://github.com/malfet
2022-11-18 18:55:33 +00:00
7ec8a4d2a2 Vectorized horizontal flip implementation (#88989)
When we benchmarked image processing transforms in torchvision : tensor vs pillow we saw that horizontal flip on uint8 data `(3, X, X)` is 2-3x slower.

Due to the fact that output's first stride is negative, implementation does a simple data copy using [`basic_loop`](8371bb8a3d/aten/src/ATen/native/cpu/Loops.h (L286)). In this PR, a vectorized path is added for horizontal flip op for dtypes: uint8, int, float32, long and double and there is a speed-up that reduces the gap between PIL and tensor ops

```
CPU capability usage: AVX2

[----------------------------------------------------------------- Horizontal flip -----------------------------------------------------------------]
                                                 |  torch (1.14.0a0+git2ed1d29) PR  |    Pillow (9.3.0)   |  torch (1.14.0.dev20221116+cu116) nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64    |        101.307 (+-0.904)         |                     |             111.364 (+-0.328)
      channels=3, size=520, dtype=torch.int64    |        462.369 (+-2.184)         |                     |             505.602 (+-0.541)
      channels=3, size=712, dtype=torch.int64    |        1855.441 (+-6.528)        |                     |             1828.370 (+-8.600)

      channels=1, size=256, dtype=torch.int32    |         22.282 (+-0.130)         |   44.218 (+-0.936)  |              34.651 (+-0.162)
      channels=1, size=520, dtype=torch.int32    |         72.180 (+-0.076)         |  166.639 (+-1.180)  |             118.820 (+-0.210)
      channels=1, size=712, dtype=torch.int32    |        129.621 (+-0.649)         |  307.140 (+-2.221)  |             216.104 (+-0.793)

      channels=3, size=256, dtype=torch.uint8    |         51.685 (+-0.200)         |   44.171 (+-0.818)  |             361.611 (+-0.276)
      channels=3, size=520, dtype=torch.uint8    |        223.320 (+-0.726)         |  166.607 (+-2.256)  |             1462.012 (+-4.917)
      channels=3, size=712, dtype=torch.uint8    |        423.298 (+-1.156)         |  307.067 (+-1.999)  |             2738.481 (+-1.715)

      channels=1, size=256, dtype=torch.float32  |         22.281 (+-0.056)         |   44.149 (+-0.808)  |              35.316 (+-0.028)
      channels=1, size=520, dtype=torch.float32  |         72.268 (+-0.106)         |  166.631 (+-1.212)  |             119.504 (+-0.340)
      channels=1, size=712, dtype=torch.float32  |        129.777 (+-0.632)         |  307.078 (+-1.909)  |             216.987 (+-0.185)

      channels=1, size=256, dtype=torch.float16  |         32.789 (+-0.081)         |                     |              34.044 (+-0.039)
      channels=1, size=520, dtype=torch.float16  |        112.693 (+-0.478)         |                     |             117.445 (+-0.125)
      channels=1, size=712, dtype=torch.float16  |        203.644 (+-0.791)         |                     |             213.283 (+-0.397)

      channels=3, size=256, dtype=torch.float64  |        102.058 (+-0.333)         |                     |             108.404 (+-0.346)
      channels=3, size=520, dtype=torch.float64  |        473.139 (+-1.327)         |                     |             503.265 (+-0.365)
      channels=3, size=712, dtype=torch.float64  |        1854.489 (+-9.513)        |                     |             1844.345 (+-1.371)

      channels=1, size=256, dtype=torch.int16    |         11.927 (+-0.056)         |                     |              33.993 (+-0.037)
      channels=1, size=520, dtype=torch.int16    |         39.724 (+-0.148)         |                     |             117.577 (+-0.153)
      channels=1, size=712, dtype=torch.int16    |         68.264 (+-0.133)         |                     |             213.118 (+-0.157)

Times are in microseconds (us).

```

```
CPU capability usage: AVX512

[----------------------------------------------------------------- Horizontal flip ------------------------------------------------------------------]
                                                 |  torch (1.14.0a0+git2ed1d29) PR  |    Pillow (9.3.0)    |  torch (1.14.0.dev20221118+cu116) nightly
1 threads: -------------------------------------------------------------------------------------------------------------------------------------------
      channels=3, size=256, dtype=torch.int64    |        131.244 (+-1.954)         |                      |             135.649 (+-4.066)
      channels=3, size=520, dtype=torch.int64    |        522.032 (+-4.660)         |                      |             539.822 (+-10.420)
      channels=3, size=712, dtype=torch.int64    |       1041.111 (+-53.575)        |                      |            1322.411 (+-80.017)

      channels=1, size=256, dtype=torch.int32    |         10.108 (+-0.414)         |   49.164 (+-1.000)   |              34.606 (+-0.865)
      channels=1, size=520, dtype=torch.int32    |         93.218 (+-1.417)         |  191.985 (+-5.047)   |             133.664 (+-5.372)
      channels=1, size=712, dtype=torch.int32    |        167.919 (+-2.854)         |  353.574 (+-6.568)   |             246.162 (+-5.753)

      channels=3, size=256, dtype=torch.uint8    |         34.710 (+-0.541)         |   49.005 (+-0.923)   |             136.603 (+-2.339)
      channels=3, size=520, dtype=torch.uint8    |        154.873 (+-3.049)         |  191.729 (+-4.997)   |             534.329 (+-10.754)
      channels=3, size=712, dtype=torch.uint8    |        290.319 (+-4.819)         |  351.619 (+-6.978)   |             997.119 (+-33.086)

      channels=1, size=256, dtype=torch.float32  |         10.345 (+-0.338)         |   49.105 (+-0.942)   |              35.478 (+-0.733)
      channels=1, size=520, dtype=torch.float32  |         81.131 (+-5.281)         |  191.697 (+-4.555)   |             133.554 (+-4.193)
      channels=1, size=712, dtype=torch.float32  |        169.581 (+-3.476)         |  352.995 (+-10.792)  |             251.089 (+-7.485)

      channels=1, size=256, dtype=torch.float16  |         35.259 (+-0.612)         |                      |              35.154 (+-0.924)
      channels=1, size=520, dtype=torch.float16  |        132.407 (+-1.980)         |                      |             131.850 (+-5.611)
      channels=1, size=712, dtype=torch.float16  |        240.192 (+-5.479)         |                      |             239.555 (+-7.273)

      channels=3, size=256, dtype=torch.float64  |        129.649 (+-2.349)         |                      |             130.429 (+-6.240)
      channels=3, size=520, dtype=torch.float64  |        548.534 (+-5.179)         |                      |             622.568 (+-25.720)
      channels=3, size=712, dtype=torch.float64  |       1208.091 (+-77.095)        |                      |            1679.204 (+-316.292)

      channels=1, size=256, dtype=torch.int16    |         7.801 (+-0.115)          |                      |              34.517 (+-0.482)
      channels=1, size=520, dtype=torch.int16    |         36.010 (+-0.855)         |                      |             131.001 (+-1.686)
      channels=1, size=712, dtype=torch.int16    |         87.395 (+-1.355)         |                      |             237.731 (+-4.181)

Times are in microseconds (us).
```

[Source](https://gist.github.com/vfdev-5/c0421f54c8aed655b042dd1ce4cb621e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88989
Approved by: https://github.com/lezcano, https://github.com/datumbox, https://github.com/peterbell10, https://github.com/ngimel
2022-11-18 18:46:53 +00:00
81a4aeabdf [Dynamo] Support Tensor.nelement & torch.cuda.is_available (#89164)
Fix several errors in [7k github models](https://github.com/pytorch/torchdynamo/issues/1198).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89164
Approved by: https://github.com/soumith
2022-11-18 18:43:15 +00:00
8a419cbffb Added partial decomposition of conv_backward and grad_bias computation (#89128)
`convolution_backward` often just kicks off the `sum` as a separate kernel. Splitting it off in a decomp allows us to fuse it into other ops: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Convolution.cpp#L2150

Improves `convnext_base` from 373 img/s => 383 img/s

Not sure what other models use convolution with bias haha.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89128
Approved by: https://github.com/ezyang
2022-11-18 17:33:17 +00:00
38ccd08f9b [quant][fx][be] Refactor replace observer with q/dq op code (#89247)
Summary:
This is a refactor to prepare for future extensions, no functionality changes

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89247
Approved by: https://github.com/vkuzo, https://github.com/andrewor14
2022-11-18 17:29:36 +00:00
c219b55b5f Use standard __func__ macro in symbolic shape. (#89264)
Summary:
I saw the following issue only on Windows build in PR #88767:
```
RuntimeError: AttributeError: 'SymNode' object has no attribute 'torch::impl::PythonSymNodeImpl::ge'
```
It's only on Windows because we get the attributes of SymNode in C++ with
`__FUNCTION__` macro, which is not in C++ standard, therefore has platform specific behavior.
In this case, MSVC will include a function's namespace and class name, which is not intended here.

Instead we should use `__func__`. see: https://en.cppreference.com/w/cpp/language/function#Function_definition

godbolt example to show the difference: https://godbolt.org/z/PGfvecxPx

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89264
Approved by: https://github.com/ezyang
2022-11-18 17:03:53 +00:00
12a97444c3 [xplat] remove -weak_framework (#89233)
Summary: The `-weak_framework` flag is no longer necessary, Buck will weakly link frameworks depending on the `target_sdk_version` of the binary being linked.

Test Plan:
Compare IG load commands before and after change with P553208168
```
load command difference in Instagram.app/Frameworks/InstagramXplatFramework.framework/InstagramXplatFramework
 --- /tmp/tmpvd97s2v0    2022-11-16 12:13:54.082910598 -0800
+++ /tmp/tmpj20r_4ca    2022-11-16 12:13:54.082910598 -0800
@@ -9,7 +9,7 @@
        /System/Library/Frameworks/CoreHaptics.framework/CoreHaptics (compatibility version 1.0.0, current version 1.0.0, weak)
        /System/Library/Frameworks/CoreImage.framework/CoreImage (compatibility version 1.0.0, current version 5.0.0)
        /System/Library/Frameworks/CoreLocation.framework/CoreLocation (compatibility version 1.0.0, current version 2780.0.17)
-       /System/Library/Frameworks/CoreML.framework/CoreML (compatibility version 1.0.0, current version 1.0.0, weak)
+       /System/Library/Frameworks/CoreML.framework/CoreML (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/CoreMedia.framework/CoreMedia (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/CoreServices.framework/CoreServices (compatibility version 1.0.0, current version 1226.0.0)
        /System/Library/Frameworks/CoreTelephony.framework/CoreTelephony (compatibility version 1.0.0, current version 0.0.0)
@@ -33,9 +33,9 @@
        /System/Library/Frameworks/Security.framework/Security (compatibility version 1.0.0, current version 60420.40.34)
        /System/Library/Frameworks/SystemConfiguration.framework/SystemConfiguration (compatibility version 1.0.0, current version 1241.40.2)
        /System/Library/Frameworks/UIKit.framework/UIKit (compatibility version 1.0.0, current version 6109.1.108)
-       /System/Library/Frameworks/UserNotifications.framework/UserNotifications (compatibility version 1.0.0, current version 1.0.0, weak)
+       /System/Library/Frameworks/UserNotifications.framework/UserNotifications (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/VideoToolbox.framework/VideoToolbox (compatibility version 1.0.0, current version 1.0.0)
-       /System/Library/Frameworks/WebKit.framework/WebKit (compatibility version 1.0.0, current version 614.2.9, weak)
+       /System/Library/Frameworks/WebKit.framework/WebKit (compatibility version 1.0.0, current version 614.2.9)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.0.0)
        /usr/lib/libbz2.1.0.dylib (compatibility version 1.0.0, current version 1.0.8)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1300.32.0)
```
Both these changes are correct, WebKit is available from 8.0, UserNotifications from 10.0 and CoreML from 11.0. Instagram has a deployment target of 12.4.

Reviewed By: ebgraham

Differential Revision: D41348639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89233
Approved by: https://github.com/malfet
2022-11-18 16:30:53 +00:00
19e66fcec2 [Quant] Allow setting fixed qparams for inner LSTM ops (#88456)
Summary: In both eager and FX graph mode quantization,
`torch.ao.nn.quantizable.LSTM` is used as an observed custom module,
which is responsible for inserting its own observers. By default,
the user specifies a single QConfig for the custom module (either
through QConfigMapping or by setting the "qconfig" attribute"),
and all inner ops will [inherit this
QConfig](dc00bb51b8/torch/ao/nn/quantizable/modules/rnn.py (L366-L378))
and use the same observer/fake_quantize constructors.

Today, users who wish to override this behavior must extend
`torch.ao.nn.quantizable.LSTM` and write a lot of custom code
to manually assign the QConfigs to the inner ops. This commit
alleviates this burden on the user by providing a helper function
to assign QConfigs with custom observers. An example use case of
this is providing a reference implementation for a backend kernel
that hardcodes qparams for efficiency.

Example usage:
```
import torch
from torch.ao.quantization import get_default_qconfig_mapping
from torch.ao.quantization.fx.custom_config import (
    PrepareCustomConfig,
    ConvertCustomConfig,
)

class MyModel(torch.nn.Module):
    ...

class UserLSTM(torch.ao.nn.quantizable.LSTM):
    @classmethod
    def from_float(cls, other):
        assert isinstance(other, cls._FLOAT_MODULE)
        linear_output_obs_ctr = FixedQParamsObserver.with_args(
            scale=2 ** -11, zero_point=2 ** 15, dtype=torch.qint32)
        sigmoid_obs_ctr = FixedQParamsObserver.with_args(
            scale=2 ** -16, zero_point=0, dtype=torch.qint32)
        tanh_obs_ctr = FixedQParamsObserver.with_args(
            scale=2 ** -15, zero_point=2 ** 15, dtype=torch.qint32)
        cell_state_obs_ctr = FixedQParamsObserver.with_args(
            scale=2 ** -11, zero_point=0, dtype=torch.qint32)
        hidden_state_obs_ctr = FixedQParamsObserver.with_args(
            scale=2 ** -7, zero_point=2 ** 7, dtype=torch.quint8)
        return torch.ao.quantization.utils._get_lstm_with_individually_observed_parts(
            float_lstm=other,
            linear_output_obs_ctr=linear_output_obs_ctr,
            sigmoid_obs_ctr=sigmoid_obs_ctr,
            tanh_obs_ctr=tanh_obs_ctr,
            cell_state_obs_ctr=cell_state_obs_ctr,
            hidden_state_obs_ctr=hidden_state_obs_ctr,
        )

qconfig_mapping = get_default_qconfig_mapping()
example_inputs = (torch.rand(5, 3, 50), torch.rand(1, 3, 50), torch.randn(1, 3, 50))
prepare_custom_config = PrepareCustomConfig() \
    .set_float_to_observed_mapping(torch.nn.LSTM, UserLSTM)
convert_custom_config = ConvertCustomConfig() \
    .set_observed_to_quantized_mapping(UserLSTM, torch.ao.nn.quantized.LSTM)
model = MyModel()
model = prepare_fx(model, qconfig_mapping, example_inputs, prepare_custom_config=prepare_custom_config)
model(*example_inputs)  # calibrate
model = convert_fx(model, convert_custom_config=convert_custom_config)
model(*example_inputs)
```

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88456
Approved by: https://github.com/jerryzh168, https://github.com/vkuzo
2022-11-18 16:27:12 +00:00
19fcb80551 [inductor] Skip DALLE2_pytorch in torchbench (#89288)
Summary: DALLE2_pytorch fails in eager as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89288
Approved by: https://github.com/Krovatkin
2022-11-18 16:21:17 +00:00
1f7c0ff6e7 [inductor] Temporarily disable functorch_dp_cifar10 test in TorchBench (#89281)
Summary: The failure wasn't caught because of a land race. Skip the test
for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89281
Approved by: https://github.com/Krovatkin
2022-11-18 16:07:44 +00:00
55e55d95ea Update torch.distributed.DistBackendError type (#89235)
Summary: Update torch.distributed.DistBackendError type based on https://fb.workplace.com/groups/pyreqa/posts/5753993921357059

Test Plan:
Pyre tests should pass?

let sandcastle run

Reviewed By: markkm

Differential Revision: D41384130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89235
Approved by: https://github.com/awgu
2022-11-18 15:27:15 +00:00
154e58c032 Add most in-place references/decompositions (#88117)
We add most in-place references in a generic way. We also implement a
wrapper to implement the annoying interface that `nn.functional`
nonlinearities have.

We fix along the way a couple decompositions for some non-linearities by
extending the arguments that the references have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88117
Approved by: https://github.com/mruberry
2022-11-18 14:59:46 +00:00
6741443c7c Simplify maybe_resize_out (#88116)
The previous behaviour would call `resize_` on 0-sized elements even
when their size was correct. This would make some test fail, as resize_
may be an in-place operation and it's not supported by some subsystems

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88116
Approved by: https://github.com/mruberry
2022-11-18 14:59:45 +00:00
ce0e22a81a Fix names of some reference functions (#88115)
The `__name__` field of some binary reference functions was wrong. We
fix this to be consistent with unary reference functions. In the future,
we should probably make the binary reference wrapper return a wrapper
itself to avoid all those calls to `partial`.

This change helps performing some homogeneous treatment of functions by
their name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88115
Approved by: https://github.com/mruberry
2022-11-18 14:59:43 +00:00
2e358cc98f Add platform markers for linux only extra_install_requires (#88826)
Fixes #88049

https://github.com/pytorch/pytorch/pull/85097 added new extra dependencies on `nvidia-*`. They are linux (GPU) only packages, but were not marked as such, causing issues installing pytorch 1.13 via Poetry (and possibly other tools that follow PyPI's metadata API) on non-Linux systems. This "fixes" the issue by adding the `; platform_system = 'Linux'` marker on these dependencies, but the main problem of different metadata for different wheels is a [somewhat larger issue](https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269).

https://github.com/pytorch/pytorch/pull/85097 used `;` as a delimiter for splitting the different deps, but that is the delimiter used in markers, so I changed to split on `|`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88826
Approved by: https://github.com/neersighted, https://github.com/lalmei, https://github.com/malfet
2022-11-18 14:09:21 +00:00
5654fed23e Export c10/[macros|util] headers to be used by internal inductor builds (#89249)
Summary: Fixes package boundary violation that existed in previous implementation

Test Plan: CI

Differential Revision: D41391862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89249
Approved by: https://github.com/izaitsevfb
2022-11-18 10:51:07 +00:00
4c6724985d [PT-D][Checkpoint] Update import and update docstring for distributed checkpoint (#89256)
Update test import and docstring as we have moved distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (https://github.com/pytorch/pytorch/pull/88698).

Test: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89256
Approved by: https://github.com/fduwjj
2022-11-18 09:49:39 +00:00
2dcacc6b99 [LTC] Upstream short_metrics (#89186)
Summary:
This pull request upstreams pytorch/xla#4148.

Test Plan:
xla CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89186
Approved by: https://github.com/JackCaoG
2022-11-18 09:28:48 +00:00
c5fafb4e16 [ao] maintain BC for is_activation_post_process (#89260)
Summary: tests are failing due to code packaged with trained models calling now defunct function names (is_activation_post_process).

this diff maintains BC temporarily until the cached code can be refreshed

Test Plan: no functional change

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89260
Approved by: https://github.com/jerryzh168
2022-11-18 07:58:51 +00:00
30c3e5afb0 Disable tracing zero_grad() (#88731)
Tracing through zero grad is slow, and doesn't provide any benefits.

Helps https://github.com/pytorch/torchdynamo/issues/1803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88731
Approved by: https://github.com/anijain2305
2022-11-18 07:46:38 +00:00
afdc48f843 Gate CUDA-only inductor tests by HAS_CUDA (#89251)
This is to prevent these tests from running on platform where CUDA doesn't exist such as macos. And they are quite flaky https://hud.pytorch.org/failure/test_linear_permute_fusion_cpu there failing the CI from time to time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89251
Approved by: https://github.com/soumith, https://github.com/desertfire
2022-11-18 07:39:18 +00:00
6a964c16e5 [flaky] relax tolerance conv1d_vs_scipy (#89193)
Fixes https://github.com/pytorch/pytorch/issues/89087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89193
Approved by: https://github.com/kit1980
2022-11-18 07:31:10 +00:00
fc1c0cd3ef Add support trace on MPS backend (#87910)
Fixes [#87221](https://github.com/pytorch/pytorch/issues/87221)
`trace` now supported on MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87910
Approved by: https://github.com/kulinseth, https://github.com/malfet
2022-11-18 07:24:33 +00:00
7beb151889 [xnnpack][executorch] remove unordered_set from xnn_compiler (#89231)
Removing unrodered_set from xnncompiler for executorch.

While some STL libraries are unavoidable, and I think it should be ok for delegate to pull these libraries, unordered_set wasn't really needed, and we should be serializing the number of external ids anyways

After this, the backend classes should be good to hg copy into executorch

Differential Revision: [D41227391](https://our.internmc.facebook.com/intern/diff/D41227391/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89231
Approved by: https://github.com/salilsdesai, https://github.com/cccclai
2022-11-18 07:07:19 +00:00
ab75982d3a Always retry curl downloads (#89157)
Modify our curl commands so that they always retry downloads.

By default, curl only retries what it considers to be "transient" errors, based on the server's response. However, curl's estimate of what's transient is very conservative.  By adding the --retry-all-errors parameter we'll always retry curl commands.

In particular, I'm hoping this mitigates errors where curl fails with the below error ([logs](https://github.com/pytorch/pytorch/actions/runs/3468758110/jobs/5794939941))
`curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to ossci-linux.s3.amazonaws.com:443`

Some of the modified downloads didn't even have retries, so I added them in

More details: https://everything.curl.dev/usingcurl/downloads/retry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89157
Approved by: https://github.com/kit1980, https://github.com/malfet
2022-11-18 07:03:24 +00:00
3bc78295c2 Fix consistentcy of histc on CPU and CUDA (#87832)
Fixes #87657

The main reason why `histc` returns slightly different outputs is the difference on how bin position is calculate.
The CPU calculates it as: 449778a939/aten/src/ATen/native/cpu/HistogramKernel.cpp (L168-L170)
which is basically `(i - a) / (b - a) * N`, while cuda code 449778a939/aten/src/ATen/native/cuda/SummaryOps.cu (L41)
 which is `(i - a) * N / (b - a)`.

For some cases like in #87657 the order of arithmetic operations matters due to the floating point round-off.

________________

Not sure where would be the most appropriate place to put the unit test. Hope `test_reductions::test_histc` will do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87832
Approved by: https://github.com/soumith
2022-11-18 05:08:47 +00:00
f1fb586bc6 Symintify repeat_interleave.self_int (#89111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89111
Approved by: https://github.com/ezyang
2022-11-18 05:04:02 +00:00
ba5e39e106 Fix tol for test_nvfuser_correctness__softmax_backward_data_cuda (#89178)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89178
Approved by: https://github.com/kit1980
2022-11-18 05:03:51 +00:00
6f609dd0e0 docs: conv2d padding attribute- add int option (#85004)
`padding: int` already exists but isn't mentioned in the genereted docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85004
Approved by: https://github.com/albanD, https://github.com/kit1980
2022-11-18 04:29:02 +00:00
6f4f69f54d [Executorch] [Quantization] New pattern for dynamic dequant (#89236)
Summary: The op exposed should be qparams, and then we have concerns about prims not being supported so make q and dq ops that take in tensors

Test Plan: unit test

Differential Revision: D41382580

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89236
Approved by: https://github.com/jerryzh168
2022-11-18 04:13:05 +00:00
f4efc5e821 [quant][be] Move some helper functions to the top level to reduce function length (#89246)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89246
Approved by: https://github.com/vkuzo
2022-11-18 04:05:27 +00:00
6ed14c7dcf [vision hash update] update the pinned vision hash (#89102)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89102
Approved by: https://github.com/pytorchbot
2022-11-18 03:45:56 +00:00
3c2676de3d [LTC] Restore GetPythonFrames (#89122)
Summary:
pytorch/pytorch@936e930 delete the registration of GetPythonFramesFunction. Restore that and add a test case to prevent regression.

Test Plan:
python test/lazy/test_debug_util.py

Fixes pytorch/xla#4206.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89122
Approved by: https://github.com/JackCaoG
2022-11-18 03:37:14 +00:00
65bcd1f880 Add previously deleted circleci readme back to repo (#85598)
This readme was deleted here: https://github.com/pytorch/pytorch/pull/73224 I chatted with the author, who doesn't remember exactly why it was deleted but suspects it was due either to out of date contents or because of the upcoming migration to github actions.

With that said, we have references to this readme through our circleci directory, and since we do still have a lot of circleci workflows I feel this readme still adds a lot of value. (I recently did some CI tasks that required me to dig this readme up in order to solve a problem).

I recommend we restore this file with a warning that its contents may be out of date, until our CircleCI workflows are entirely migrated to Github Actions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85598
Approved by: https://github.com/clee2000, https://github.com/malfet
2022-11-18 03:17:37 +00:00
92f9214a31 add -Wnarrowing as error to cmake builds (#89207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89207
Approved by: https://github.com/wconstab, https://github.com/malfet
2022-11-18 03:16:18 +00:00
fd0efb01a7 [MPS] Support for median with dim (#88807)
## Summary 

**Aim**: Add support for aten::median for MPS backend (Fixes #87220)

This is fresh clean PR from the previous [PR](https://github.com/pytorch/pytorch/pull/88554)

- Implementing the new median function in aten/src/ATen/native/mps/operations/ReduceOps.mm
- Adding it to aten/src/ATen/native/native_functions.yaml
- Adding it to existing test_median

### **this will works like this** 🪶
median of entire input tensor on MPS
`torch.median(mps_inputTensor)`
median of along a dim
`torch.median(mps_inputTensor, dim=[int], keepdim=[Bool])`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88807
Approved by: https://github.com/kulinseth
2022-11-18 02:53:42 +00:00
9fd00f194a Fix the kineto daemon build condition (#89174)
If we're not building the lite interpreter we shouldn't be disabling Kineto. This eliminates a step from https://github.com/facebookincubator/dynolog/blob/main/docs/pytorch_profiler.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89174
Approved by: https://github.com/kimishpatel, https://github.com/malfet
2022-11-18 02:42:45 +00:00
b652fbc57a Fix torch.nn.functional.gelu docstring formatting (#89061)
The docstring of `torch.nn.functional.gelu` is formatted incorrectly, so that part of the math isn't rendered and there are extra blocks when there shouldn't: https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html

I didn't build the docs, so I am not 100% sure that I got the formatting right, but I am confident.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89061
Approved by: https://github.com/bdhirsh, https://github.com/kit1980
2022-11-18 01:57:41 +00:00
177621a0b2 Use pytest-flakefinder to rerun tests multiple times (#89106)
Per title. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest.

### Testing

* https://github.com/pytorch/pytorch/actions/runs/3484930558
* https://github.com/pytorch/pytorch/actions/runs/3484930319

Manually download the test report artifacts and verify that that pytest test_ops is called multiple times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89106
Approved by: https://github.com/clee2000
2022-11-18 00:11:44 +00:00
57e05e822d Issue 68576 prefetch factor (#88972)
Fixes #68576
This PR allows set the `prefetch_factor=None` making it really optional according to the documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88972
Approved by: https://github.com/kit1980
2022-11-18 00:10:50 +00:00
2b3ac879a7 feat: adding view_copy_batch_rule and opinfo for view_copy (#88150)
to add view_copy to vmap dispatch and adding opinfo

part of https://github.com/pytorch/functorch/issues/825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88150
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2022-11-17 23:36:18 +00:00
31b10e7d40 Enable inductor CI for TorchBench (#87465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87465
Approved by: https://github.com/malfet
2022-11-17 23:16:21 +00:00
3d8a853a87 [DataPipe] Add container template for _Fork and _Demux (#89216)
- This would remove the hard-coded check within `_ChildDataPipe`.
- Add `get_length_by_instance` to parent class to make sure there is a chance that child DataPipe can have different lengths
- Prevent Error when `__del__` executed when the object has already been removed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89216
Approved by: https://github.com/NivekT
2022-11-17 23:06:41 +00:00
e2229a89b0 Fix typo in aten/src/README.md (#89175)
remove redundant "have to"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89175
Approved by: https://github.com/kit1980
2022-11-17 22:28:23 +00:00
a695fcf201 Add tests for replicate multiple modules (#89099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89099
Approved by: https://github.com/zhaojuanmao
2022-11-17 22:27:15 +00:00
767f6aa49f [JIT][Security] Do not blindly eval input string (#89189)
Introduce `_eval_no_call` method, that evaluates statement only if it
does not contain any calls(done by examining the bytecode), thus preventing command injection exploit

Added simple unit test to check for that
`torch.jit.annotations.get_signature` would not result in calling random
code.

Although, this code path exists for Python-2 compatibility, and perhaps
should be simply removed.

Fixes https://github.com/pytorch/pytorch/issues/88868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89189
Approved by: https://github.com/suo
2022-11-17 22:05:30 +00:00
fbbf368745 Fix distributed test paths when running periodic multigpu job (#89225)
Some distributed tests are moved to a new location after https://github.com/pytorch/pytorch/pull/88698. This is currently failing periodic multigpu job:

* https://github.com/pytorch/pytorch/actions/runs/3484486207/jobs/5829301159
* https://github.com/pytorch/pytorch/actions/runs/3484486207/jobs/5829301093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89225
Approved by: https://github.com/clee2000
2022-11-17 21:33:59 +00:00
f057a45faf reland "support running test_mobile_profiler with buck1/buck2 and OSS (#89001)" (#89091)
We modify this to no longer use std::experimental::filesystem::path
and use our own custom type instead.

This reverts commit c53a5ac6cca7e2e7d7c47b1a816c7eaa2e7a7704.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89091
Approved by: https://github.com/r-barnes, https://github.com/malfet
2022-11-17 21:04:23 +00:00
e856a4d66b Add an env var to skip cudnn version compatibility check (#89184)
skip the check by setting `PYTORCH_SKIP_CUDNN_COMPATIBILITY_CHECK=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89184
Approved by: https://github.com/ngimel
2022-11-17 20:10:52 +00:00
04169c5b6e Rewrite assert statement with torch._assert under config (#88246)
This diff rewrites assert statement in python with torch._assert under config. The resulting graph looks something like:
```
SOURCE CODE:
def f(x):
      assert x[0] == 3
      return x.cos()

CAPTURED GRAPH:
graph():
    %arg0 : [#users=2] = placeholder[target=arg0]
    %getitem : [#users=1] = call_function[target=operator.getitem](args = (%arg0, 0), kwargs = {})
    %eq : [#users=1] = call_function[target=operator.eq](args = (%getitem, 3), kwargs = {})
    %_assert : [#users=0] = call_function[target=torch._assert](args = (%eq, "assertion_error"), kwargs = {})
    %cos : [#users=1] = call_method[target=cos](args = (%arg0,), kwargs = {})
    return cos
 ```
Note that this introduces side-effect as it could error out while executing graph, but the assertion can eliminated via DCE if we choose to ignore it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88246
Approved by: https://github.com/jansel
2022-11-17 19:49:31 +00:00
af448e84eb Fix bug in dynamo dashboard summary stats diff (#89226)
Fixes issue where a suite may not be present in one of the logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89226
Approved by: https://github.com/anijain2305
2022-11-17 19:20:49 +00:00
706f791a19 Revert "Support masked_fill (#88736)"
This reverts commit 2b131b1d43b10a2a005f3f042f920a62501e4e2d.

Reverted https://github.com/pytorch/pytorch/pull/88736 on behalf of https://github.com/kit1980 due to Inductor tests are failing with AttributeError: module 'torch._inductor.codecache' has no attribute 'valid_vec_isa_list'
2022-11-17 18:27:08 +00:00
8e4c9828f4 Revert "Reland "Towards unifying symbolic and non symbolic fake tensor (#89038)" (#89143)"
This reverts commit e686b8c3ba93cb7caa314c78bf84dbd2d7df9683.

Reverted https://github.com/pytorch/pytorch/pull/89143 on behalf of https://github.com/ZainRizvi due to This seems to be causing the test_make_fx_symbolic_exhaustive_rad2deg_cpu_float32 and test_make_fx_symbolic_exhaustive_inplace_rad2deg_cpu_float32 test to fail across multiple jobs
2022-11-17 17:02:36 +00:00
cd81a700ec Fix buffer overflow from AddressSanitizer checks due to inaccurate bfloat16 representation of large integer (#89210)
Fixes #88939

The root cause of the issue is that BF16 cannot accurately represent big integer values. In the test case below, `539` as one of the corner pixel index is wrongly represented as `540` (from fc60a1865e/aten/src/ATen/native/UpSample.h (L271)) and then the access out of the range with this index. Thanks to @malfet for the investigation and initial fix. I also reported an issue https://github.com/pytorch/pytorch/issues/89212 to track the issue of inaccurate integer representation of bf16 that need to be addressed in other places of PyTorch.
```python
import torch

def test():
    arg_1 = torch.rand([1, 10, 540, 540], dtype=torch.bfloat16).clone()
    res = torch.nn.functional.interpolate(arg_1,2,mode='bilinear',align_corners=True)

test()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89210
Approved by: https://github.com/malfet
2022-11-17 16:43:16 +00:00
2b131b1d43 Support masked_fill (#88736)
Support `masked_fill` to address the GPT2 performance issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88736
Approved by: https://github.com/jansel, https://github.com/jgong5
2022-11-17 15:18:29 +00:00
e686b8c3ba Reland "Towards unifying symbolic and non symbolic fake tensor (#89038)" (#89143)
This reverts commit cf6003f0469ae1440d4a8585860c2c5f4c738707.

Differential Revision: [D41363992](https://our.internmc.facebook.com/intern/diff/D41363992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89143
Approved by: https://github.com/albanD
2022-11-17 13:55:06 +00:00
bdc9911575 Fix typo in dist_util.py (#89167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89167
Approved by: https://github.com/davidberard98
2022-11-17 08:45:27 +00:00
3beccbc299 Add BFloat16 support and optimization for mish, hardtanh backward, and silu on CPU (#82460)
### Description
* add BFloat16 support for mish and hardtanh backward on CPU.
* optimize the performance for silu

### Testing

- optimize the performance for silu: bfloat16

single socket (28 cores):
```
before: 1x128x1024  forward 0.090 s  backward  0.218 s
        10x128x1024 forward 0.146 s  backward  0.314 s

after:  1x128x1024   forward  0.064 s backward  0.100 s
        10x128x1024  forward  0.085 s backward  0.133 s
```
single core:
```
before: 1x128x1024   forward 0.300 s  backward  0.606 s
        10x128x1024  forward 2.825 s  backward  5.834 s

after:  1x128x1024   forward 0.156 s backward   0.239 s
        10x128x1024  forward 1.447 s backward   2.165 s
```

- Add BFloat16 support for mish and backward of hardtanh on CPU.

single socket (20 cores):
op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
-- | -- | -- | -- | -- | --
  |   | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 4.41E-05 | 7.67E-05 | 5.32E-05 | 9.38E-05
  | [10, 128, 80, 80] | 0.0008 | 0.001788 | 0.00067 | 0.001031
mish | [10, 128, 10, 10] | 0.000356 | 0.000427 | 0.000367 | 0.000436
  | [10, 128, 80, 80] | 0.004527 | 0.005807 | 0.004757 | 0.005393
hardtanh | [10, 128, 10, 10] | / | 3.97E-05 | / | 4.45E-05
  | [10, 128, 80, 80] | / | 0.001748 | / | 0.000645

single core:
op | shape | fp32 / s | fp32 / s | bf16 / s |  bf16 / s
-- | -- | -- | -- | -- | --
  |   | forward | backward | forward | backward
silu | [10, 128, 10, 10] | 1.17E-04 | 1.91E-04 | 1.35E-04 | 2.23E-04
  | [10, 128, 80, 80] | 0.007434 | 0.013141 | 0.008464 | 0.013044
mish | [10, 128, 10, 10] | 0.00103 | 0.00122 | 0.00106 | 0.001227
  | [10, 128, 80, 80] | 0.065629 | 0.078418 | 0.067779 | 0.077214
hardtanh | [10, 128, 10, 10] | / | 1.18E-04 | / | 9.30E-05
  | [10, 128, 80, 80] | / | 0.010773 | / | 0.005834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82460
Approved by: https://github.com/mingfeima, https://github.com/malfet
2022-11-17 08:15:52 +00:00
37c85cf5f2 Add warning if tensor cores are not used (#88844)
Fixes https://github.com/pytorch/torchdynamo/issues/1839

Should I do this for all backends or just inductor?

## Test
On a V100 I got from AWS

```python
from torch._dynamo import optimize
import torch

def fn(x, y):
    a = torch.cos(x)
    b = torch.sin(y)
    return a + b

new_fn = optimize("inductor")(fn)

a = new_fn(torch.Tensor(1),torch.Tensor(1))
print(a)
```

## New logs
```
(sourcetorch) ubuntu@ip-172-31-31-152:~/test$ python test.py
/home/ubuntu/pytorch/torch/_dynamo/eval_frame.py:318: UserWarning: Tensor cores are available but not enabled. Consider setting torch.backends.cuda.matmul.allow_tf32 == True in your python script for speedups
  warnings.warn(
tensor([1.3717])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88844
Approved by: https://github.com/ngimel, https://github.com/mlazos, https://github.com/anijain2305
2022-11-17 07:24:58 +00:00
b72f5b9ae3 [Dynamo] Support typing.Mapping & Support function as argument (#88963)
These missing features come from https://github.com/pytorch/benchmark/pull/1302, where we'd like to enable E2E hf_bert dynamo train/eval. The dependent [HuggingFace accelerate library](https://huggingface.co/docs/accelerate/index) requires these improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88963
Approved by: https://github.com/jansel
2022-11-17 06:57:42 +00:00
126e44173d [ONNX] Add onnx-script into ONNX docs (#89078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89078
Approved by: https://github.com/BowenBao
2022-11-17 06:27:17 +00:00
74610a1ced [dynamo][benchmarks] HF - Fix seq len and batch sizes (#89165)
Fixes many models in https://github.com/pytorch/torchdynamo/issues/1842
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89165
Approved by: https://github.com/ngimel
2022-11-17 06:14:24 +00:00
a41f70603a Round out rad2deg sparse support (#88442)
- Add sparse coo dispatch
- Modify backward to work with sparse compressed layouts
- Enable sparse_compressed autograd testing
- Correct layout support attributes on OpInfo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88442
Approved by: https://github.com/cpuhrsch
2022-11-17 06:00:23 +00:00
70fb673e51 Use software approach to catch overflow ( c10/utils/safe_numerics.h ) on ARM devices (#89042)
Fixes #89040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89042
Approved by: https://github.com/malfet
2022-11-17 05:55:28 +00:00
54fca6a9da Fix: prefer .is_none() over .is(py::none()) for pybind11 in caffe2 (#88199)
Follow up to #88051 . I noticed that I missed a few spots in the caffe2 folder. Prefer `.is_none()` over `.is(py::none())` as `.is_none()` is more efficient since it avoid reference counting increments and decrements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88199
Approved by: https://github.com/albanD, https://github.com/kit1980
2022-11-17 05:01:11 +00:00
4e1d19c5a5 Revert "Redefine the simdlen semantic: (#88482)"
This reverts commit fce6d6b3dcc879720bc45143426b86232106818a.

Reverted https://github.com/pytorch/pytorch/pull/88482 on behalf of https://github.com/kit1980 due to Broke multiple tests in several trunk workflows, for example https://github.com/pytorch/pytorch/actions/runs/3485086792/jobs/5830429554
2022-11-17 04:58:53 +00:00
81a8fdc40d [MPS] Add binary operations dtype precedence test case (#87545)
See https://github.com/pytorch/pytorch/pull/84742 and https://github.com/pytorch/pytorch/pull/78319.

The test case tests that
- for the binary operations (add, sub, mul, div),
- for all data types (dtypes),
- for a range of representative values and their combinations,
- for various shapes and ways of creating the test tensors,

the contents and dtype of the result tensor is identical for the MPS and CPU backends.

It adds about 15-18s runtime to `test_mps.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87545
Approved by: https://github.com/kit1980
2022-11-17 04:54:27 +00:00
44c9185f91 Fix empty input issue of convolution for channels last memory format (#86521)
Fixes empty input convolution issue : when input is empty e.g. shape of (0, 3, 3, 4) and weight is channels last format, at::_unsafe_view will raise "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86521
Approved by: https://github.com/jgong5, https://github.com/malfet
2022-11-17 04:47:45 +00:00
637e764ec5 [xnnpack][executorch] Pass xnnexecutor pointer to compileModel() (#89090)
Here we pass XNNExecutor* to compile model so that XNNExecutor can be allocated by runtime. This signature change is for executorch:

```
XNNExecutor compileModel(void* buffer) --> void compileModel(void* buffer, XNNExecutor* executor)
```

The intended usecase for allocating Executor and Compiling the serialized flatbuffer:

```
XNNExecutor* executor = runtime_allocator->allocateList<jit::xnnpack::delegate::XNNExecutor>(1);
XNNCompiler::compileModel(processed.buffer, executor);

```

Differential Revision: [D41208387](https://our.internmc.facebook.com/intern/diff/D41208387/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89090
Approved by: https://github.com/digantdesai
2022-11-17 04:29:25 +00:00
24b9890f03 [torchrec] [composable] update ShardedEmbeddingBagCollection to be use registered EBCs with shardedTensors as registered modules (#758) (#88026)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/758

This PR fixes a bug in FSDP/DDP, where ShardedTensors are not supported even if passed in as params to ignore.
this is important for composability because TorchRec named_parameters() will return FQN of shardedTensors (as defined in goals)
It defines device of ShardedTensor to be None when local_tensor() does not exist on rank

update ShardedEmbeddingBagCollection to be composable according to https://docs.google.com/document/d/1TBJSd5zgEg6cRcXv3Okuj7bBkqQwGS2IPh4TLWNNzFI/edit

Differential Revision: D40458625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88026
Approved by: https://github.com/wanchaol, https://github.com/rohan-varma
2022-11-17 04:26:13 +00:00
1cd6ebe095 Fix typos in messages under torch (#89049)
This PR fixes typos of messages in `.py` files under torch directory.
Only in `torch/onnx/symbolic_opset16.py`, fix a typo in comment to make the operator name correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89049
Approved by: https://github.com/lezcano
2022-11-17 04:18:14 +00:00
d1f48f05ce [xnnpack][Bug Fix] Pass serialized model by reference (#89089)
Two changes
- Remove XNNCompiler Dependence on std::string by passing void*
- Grab ser_model by reference: This bug was causing data pointers given to xnn_runtime to be freed because ser_model was on the stack.

Differential Revision: [D41208380](https://our.internmc.facebook.com/intern/diff/D41208380/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89089
Approved by: https://github.com/digantdesai
2022-11-17 04:17:23 +00:00
366f1b2c2f [xnnpack][lite-int] Freeze/Inline module to remove reference to self (#88863)
We need to inline graph before converting from torchscript to xnnpack flatubuffer. Remove graph dependence on self.

This will later help us work with constant data.

Differential Revision: [D41049858](https://our.internmc.facebook.com/intern/diff/D41049858/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88863
Approved by: https://github.com/digantdesai
2022-11-17 04:14:57 +00:00
1adb7b9b84 [nn][utils] Preserve requires_grad from original weight and bias in fuse conv/linear bn weights (#89100)
Summary:
att, previously we just call nn.Parameter which will have requires_grad=True by default, after
this PR we will preserve the requires_grad

Test Plan:
python test/test_nn.py TestFusionUtils

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41343694](https://our.internmc.facebook.com/intern/diff/D41343694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89100
Approved by: https://github.com/ngimel
2022-11-17 03:58:16 +00:00
a5f04e9a91 Fix typos in .md and .rst files (#88962)
This PR fixes typos `Github` in `.md` and `.rst` files.
`Github` -> `GitHub`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88962
Approved by: https://github.com/kit1980
2022-11-17 03:37:02 +00:00
573eaf1225 Analyze and upload disabled tests rerun to S3 (#89083)
Analyze and upload disabled tests rerun to S3. Note that this only picks up `test-reports` from `rerun_disable_tests` workflows.

### Testing

Running the script manually `python -m tools.stats.check_disabled_tests --workflow-run-id 3473068035 --workflow-run-attempt 1 --repo pytorch/pytorch` and see the files successfully uploaded to s3://ossci-raw-job-status/rerun_disabled_tests/3473068035/1

Rockset collection created https://console.rockset.com/collections/details/commons.rerun_disabled_tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89083
Approved by: https://github.com/clee2000
2022-11-17 03:36:58 +00:00
fce6d6b3dc Redefine the simdlen semantic: (#88482)
This PR is targeting to automatically enable vectorization optimization for TorchInductor. It refined the semantics of `config.cpp.simdlen`.

Originally, `None` means to disable vectorization while a specific value means the number of elements to be vectorized once time. But it depends on the data. Regarding 256bit SVE/SIMD ISA for ARM and X86, the `simdlen` should be 16 for Float while 32 for BFloat. Hence, this PR defined the `simdlen` as the bit width. The detailed semantics are as follows.

- **_simdlen = None_**: Automatically determine the SIMD bit width. Detect HW information and pick the proper vectorization ISA. Specific for X86, the priority of AVX512 is higher than AVX2.
- **_simdlen <=1_**: Explicitly disable SIMD
- **_simdlen > 1_**: Explicitly specify the SIMD bit width. It equals the disabled semantic if the bit width does not match the ISA width.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88482
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-17 03:27:54 +00:00
c3acb9c885 [ONNX] Add Internal Utils: onnx_proto_utils.py for onnx/onnx-script/onnx_proto (#88376)
Added `onnx_proto_utils.py` for onnx/onnx-script related process. The idea is like jit_utils.py, and to simplify what we have in `torch/onnx/utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88376
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-17 03:08:09 +00:00
f3af5ba48e [WIP] Composable API: replicate and DistributedState (#87649)
This PR adds the first version of the `replicate()` composable API. For this prototype version, I try to reuse as much code from existing `DistributedDataParallel` as possible, and iterate on it in later changes. The basic idea of this prototype is:
- create a `ReplicateState` object. It internally uses a `ParameterList` module to hold all parameters of modules marked by `replicate()` API.
- create an internal `_ddp` object, which reuses existing `DistributedDataParallel` implementation, and wraps the `ParameterList` object
- install pre-forward and after-forward hooks on the root module, which calls methods of `_ddp` to run initialization and forward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87649
Approved by: https://github.com/zhaojuanmao
2022-11-17 03:06:31 +00:00
f73d9a79fe [torch][fx] Fix PassManager to not use a class variable mutable list (#89108)
Summary:
I found a confusing bug in the PassManager that only happens
when you instantiate one multiple times: it will use old passes and
constraints!

This occurs because the class-level declarations initialize it to an empty list,
but the problem is that class initializers only run once, and are creating class
variables. This means the same empty list was being reused every time, except
after the first time it isn't empty.

The empty list has to be created in `__init__` newly each time or else it'll be shared.
Note that this is the same type of bug as using an empty list as a default parameter, where
it'll reuse the same list pointer and not make it empty each time.

The better way to do this is with either:
* An immutable default parameter like an empty tuple, that you create a new list from: `self.passes = list(passes)`
* Use None and then create the empty list inside `__init__`

I chose the latter as it's less likely to cause a behavior change due to the changed default.

Note that for immutable values like `False` and `1` this doesn't apply as you can't mutate that
value for everyone.

Test Plan:
Added a test to ensure that the pass state is not saved.
Without my change, this test would fail as it would run all of the `2 * x` passes first,
then all of the `3 * x` passes.

Differential Revision: D41327056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89108
Approved by: https://github.com/angelayi
2022-11-17 02:43:33 +00:00
ac0a6f381d [dtensor] disable op db tests for now (#89162)
context: https://github.com/pytorch/pytorch/issues/89160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89162
Approved by: https://github.com/fduwjj
2022-11-17 02:31:23 +00:00
30d9fb9157 [dynamo][reland] API Support for nn.Module (#89113)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89113
Approved by: https://github.com/ezyang
2022-11-17 02:03:48 +00:00
f5e2cb5249 Add comprehensive minifier tests (#88022)
Adds tests for https://github.com/pytorch/torchdynamo/issues/1241.

To run: `pytest test/dynamo/test_minifier.py`.

Actually runs minifier launcher script and repro scripts, rather than just checking for existence of the minifier launcher script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88022
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2022-11-17 02:02:29 +00:00
088f2fa567 Fix typos in messages under test (#89121)
This PR fixes typos of messages in `.cpp` and `.py` files under test directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89121
Approved by: https://github.com/mruberry, https://github.com/kit1980
2022-11-17 01:55:03 +00:00
716f70f19a Added conv constraint that infers layouts (#89031)
The core problem that we often have with contiguous/channels-last layouts and convolutions is that Inductor often doesn't do a great job of "preserving" the eager-mode layouts.

So, for example, we'll often have something like
```
a: channels-last
b = foo(a)
c = convolution(a)
```

In eager-mode, `a` would stay channels-last, and we would avoid two transpose copies (one into NHWC and one back into NCHW) within the convolution kernel.

However, Inductor currently sometimes loses the "correct" layout of `b` (not in this simple example, but others). Then, not only will we do a transpose within `foo`, but we'll then immediately transpose it back to do the convolution (and then again once the convolution is done).

This is particularly egregious in `convnext_base`, where there's a lot of mixing of non-channels last tensors and channels-last tensors.

The solution in this PR is to constrain the inputs to `aten.convolution`/`aten.convolution_backward` to match the layouts from eager-mode. This ensures that we'll never do extra transposes *within* `aten.convolution`, which are particularly bad (since Inductor can't fuse them).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89031
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-17 01:52:35 +00:00
251fdda77b Add pytest-flakefinder as a test dependency (#89103)
This is used to re-run tests multiple times to determine their flakiness status. The way re-run is handled in https://github.com/pytorch/pytorch/pull/88646 only applies to unittest

Per their documentation, `pytest-repeat` doesn't work with `unittest.Testcase` it seems, so trying https://github.com/dropbox/pytest-flakefinder instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89103
Approved by: https://github.com/clee2000
2022-11-17 01:45:50 +00:00
0d87a4fec8 Fix typo in Dispatcher.h (#89045)
Fix typo in Dispatcher.h: hamespace -> namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89045
Approved by: https://github.com/bdhirsh, https://github.com/kit1980
2022-11-17 01:09:55 +00:00
80b6761863 Update README.md (#85534)
Our jenkins builds are gone, so this badge is broken and should be removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85534
Approved by: https://github.com/ngimel, https://github.com/kit1980
2022-11-17 01:06:15 +00:00
3af5cf4de1 doc(typo): memroy -> memory (#89126)
Minor typo in comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89126
Approved by: https://github.com/kit1980
2022-11-17 01:03:34 +00:00
cfd552547f Use the Python frame safely in _pythonCallstack (#88993)
Currently, the result of `PyEval_GetFrame()` is piped straight to `Py_INCREF`. However, `PyEval_GetFrame` [may return null](https://docs.python.org/3/c-api/reflection.html#c.PyEval_GetFrame), which seems to be the case sometimes, when calling `_pythonCallstack` from another thread. This is handled in the subsequent `while (nullptr != frame)` block, but `Py_INCREF`, called before it, [doesn't handle this case](https://docs.python.org/3/c-api/refcounting.html#c.Py_INCREF), so the program segfaults. The safe form of `Py_INCREF` is `Py_XINCREF`, so use that instead ([docs](https://docs.python.org/3/c-api/refcounting.html#c.Py_XINCREF)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88993
Approved by: https://github.com/albanD
2022-11-17 00:59:15 +00:00
8506b305df handle scatter(Scalar) overload in inductor (#88894)
Relanding https://github.com/pytorch/pytorch/pull/88210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88894
Approved by: https://github.com/desertfire
2022-11-17 00:38:47 +00:00
0c835e25bb Fix nightly build binary errors (#89153)
This is pretty much self explanatory issues
Two typo's in generate generate binary script caused workflows to be generated with invalid parameters:

1 .generated-linux-binary-libtorch-pre-cxx11-master.yml
2 .generated-macos-arm64-binary-wheel-nightly.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89153
Approved by: https://github.com/malfet
2022-11-17 00:30:12 +00:00
98379a3949 [ONNX] Add onnx-script test cases (#86907)
The test cases for #86906
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86907
Approved by: https://github.com/BowenBao
2022-11-16 23:57:25 +00:00
f920bfaf2a Use torchrun for dynamo/distributed.py (#89149)
Mainly wanted to confirm torchrun works fine with dynamo/ddp,
but it is also a better system than manually launching processes.

Partially addresses issue #1779

New run commands
------------

single process:
python benchmarks/dynamo/distributed.py [args]

multi-gpu (e.g. 2 gpu on one host):
torchrun --nproc_per_node 2 benchmarks/dynamo/distributed.py [args]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89149
Approved by: https://github.com/aazzolini
2022-11-16 23:05:34 +00:00
8ba62bdff5 add test_c10d_spawn_ucc.py (#86508)
Initial PR to create UCC equivalent of https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_gloo.py and
https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_spawn_nccl.py. Currently only added common ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86508
Approved by: https://github.com/kwen2501
2022-11-16 22:50:11 +00:00
ec61951f07 Fix inaccuracy in nt constructor documentation + broken rendering (#89152)
Rendering was broken and docstring seemed to be inaccurate

![Screen Shot 2022-11-16 at 2 16 28 PM](https://user-images.githubusercontent.com/35276741/202273588-a2da5b7b-1a6d-46bb-a74e-c0de9a0fd064.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89152
Approved by: https://github.com/cpuhrsch
2022-11-16 22:32:46 +00:00
5848704ef8 Removed unecessary check in select_nested (#89150)
Implementation in  #88585 should work for all dimensions. Removed unnecessary check that constrained select to dims 0 and 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89150
Approved by: https://github.com/cpuhrsch
2022-11-16 22:11:37 +00:00
ee1d375bf9 [FSDP] Add fast path for NO_SHARD clip_grad_norm_() (#89137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89137
Approved by: https://github.com/rohan-varma
2022-11-16 22:08:50 +00:00
e70f446a16 [Dynamo] Fix bug in NamedTupleVariable (#89110)
Fixes https://github.com/pytorch/torchdynamo/issues/1866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89110
Approved by: https://github.com/jansel
2022-11-16 21:59:31 +00:00
640af8d70a More dynamo dashboard improvements (#89155)
A number of dashboard improvements:
- Add accuracy failures to warnings section
- Add regression detection to all metrics (speedup, compile time, peak memory), not just accuracy
- Add testing flag to update-dashboard to prevent image/comment uploads
- Add section for comparing summary statistics (passrate, speedup) between 2 most recent reports
- Show names of reports for summary stats diff and regression detection sections
- Remove metric graphs from the comment (they can still be found in the generated text file)

Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1317565972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89155
Approved by: https://github.com/anijain2305
2022-11-16 21:54:27 +00:00
305b9b1f0e Fix XLASymNode.str() no str() attribute error (#89093)
This fixes https://github.com/pytorch/xla/issues/4199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89093
Approved by: https://github.com/ezyang
2022-11-16 21:54:20 +00:00
4908a12542 Reland "SymIntify convolution backend calculation (#89069)"" (#89142)
This reverts commit 90db86be108184a6c86c73e1b01012352c72e66b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89142
Approved by: https://github.com/albanD, https://github.com/malfet
2022-11-16 21:41:47 +00:00
45c62a3377 [ao] making _is_activation_post_process private (#87520)
Summary: same function in observer and quantize, consolidated to a
single function. Note the definitions were slightly different, I've
changed the definition to be maximally inclusive so that the name of the
function is more accurate

Test Plan: python test/test_public_bindings.py
python test/test_quantization.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709276](https://our.internmc.facebook.com/intern/diff/D40709276)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87520
Approved by: https://github.com/jcaip
2022-11-16 21:31:57 +00:00
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00
6b521bbf35 Prevent module full_backward_hook from erroring in double backward (#88357)
Also clarifies documentation to say "execute if and only if gradients wrt outputs are computed" (previously, "execute every time gradients wrt inputs are computed")

See https://docs.google.com/document/d/1tFZKYdsSzRBJ7Di7SWt8X8fSg-E3eiUPwomMF10UyhM/edit for more details regarding the question: 'should module full_backward_hooks be called every time the gradients wrt module inputs are called, or should module full_backward_hooks only be called when the "backward for the module" have been computed?'

Fixes https://github.com/pytorch/pytorch/issues/88312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88357
Approved by: https://github.com/albanD
2022-11-16 19:27:30 +00:00
0581331963 [ONNX] Document ONNX diagnostics (#88371)
Reference pages:
- Landing page: https://docs-preview.pytorch.org/88371/onnx_diagnostics.html
- Individual rule: https://docs-preview.pytorch.org/88371/generated/onnx_diagnostics_rules/POE0004%3Aoperator-supported-in-newer-opset-version.html

An initial PR to setup the document generation for ONNX diagnostics.
* Add document page for ONNX diagnostics.
* Add document generation for diagnostics rules from `rules.yaml`.
* Add dependency on `myst-parser` for markdown to rst parsing.

More content to be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88371
Approved by: https://github.com/abock, https://github.com/justinchuby, https://github.com/malfet, https://github.com/kit1980
2022-11-16 19:21:46 +00:00
848e7240a1 [Dynamo] Add a dummy profiler to avoid activating real profiler (#88930)
See context at https://github.com/pytorch/torchdynamo/issues/1721#issuecomment-1312396059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88930
Approved by: https://github.com/jansel
2022-11-16 19:08:49 +00:00
61801799a0 [Quant][bc-breaking] Remove overwrite_output_observer (#88620)
Summary: When the BackendConfig was first introduced,
`overwrite_output_observer` and `overwrite_output_fake_quantize`
were added to ensure fixed qparams ops like `torch.nn.Sigmoid`
and `torch.nn.Tanh` used the correct observers and fake quantizes.
However, this is hacky because the BackendConfig should not set
the observer constructors themselves, but should instead specify
only requirements on the observers.

Later, https://github.com/pytorch/pytorch/pull/80184 added the
correct observers to `get_default_qconfig_mapping` along with
validation logic that throws an error if incorrect observers
were specified. With this change, we no longer need to overwrite
the observers from the BackendConfig, since we expect the user to
pass in the correct observers for these ops.

This commit removes these overwrite observer settings in the
BackendConfig. Instead, we represent the observer constraints for
fixed qparams ops through the existing DTypeWithConstraints
mechanism. Note that, however, to be consistent with other
DTypeWithConstraints checks, we no longer throw an error if an
incorrect observer is specified, but simply ignore the offending
QConfig and log a warning instead. This is the BC-breaking part
of the change.

BC-breaking notes:

```
from torch.ao.quantization.qconfig import default_qconfig
from torch.ao.quantization.quantize_fx import prepare_fx

model = ModelWithFixedQParamsOps()
qconfig_mapping = QConfigMapping().set_global(default_qconfig)
example_inputs = ...
prepare_fx(model, qconfig_mapping, example_inputs)
```

Before this commit, running the above leads to an exception
because the wrong observers are used for fixed qparams ops.
After this commit, the above will only encounter a warning,
and the fixed qparams ops will not be quantized. In both cases,
switching to `get_default_qconfig_mapping` will cause the
fixed qparams ops to be quantized.

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88620
Approved by: https://github.com/jerryzh168
2022-11-16 18:44:12 +00:00
a6ef2c7634 Support test-config filter logic for rocm (#89046)
The logic used by `mem_leak_check` https://github.com/pytorch/pytorch/pull/88373 is currently not applied to rocm, i.e. 06486cd008 because its workflows don't have the test-config filtering logic yet (linux, mac, and windows all have it already). In another work, rocm tests always run with mem leak check disabled at the moment. We want that but also to run the test with mem leak check enabled periodically one per day.  This PR closes that gap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89046
Approved by: https://github.com/clee2000
2022-11-16 18:25:38 +00:00
7b0adc290a Run tests from test/inductor in inductor CI job (#88957)
CUDA inductor tests are currently not run in CI because the only jobs
that have triton installed don't actually run these test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88957
Approved by: https://github.com/ngimel, https://github.com/seemethere
2022-11-16 17:54:13 +00:00
58ebf92cf0 Add bfloat16 support to torch.prod to align with torch.cumprod (#87205)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87205
Approved by: https://github.com/mruberry
2022-11-16 17:46:54 +00:00
3320915303 Fix decomp for embedding_backward and simplify the decomposition of embedding_dense and embedding_dense_backward (#87204)
See the title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87204
Approved by: https://github.com/Chillee
2022-11-16 17:46:54 +00:00
e1ecf53d84 Simplify linspace decomp and increase its tolerance (#87203)
This is an interesting one

Since this is an operation that's intrinsically defined on the reals,
we should perform the ops on that dtype always, and just cast to
the desired dtype at the end. This simplifies the decomposition.

Now, I started looking at this one when I started seeing failures on a
test that's added in a later PR. What's going on here is that, by doing
an upcast to a higher dtype and then cast down to integers, sometimes
there's an off-by-one error. I think this is fine, as the decomposition
is more accurate than the original function, which goes in line with
the whole PrimTorch effort.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87203
Approved by: https://github.com/mruberry
2022-11-16 17:46:54 +00:00
d2d22d89d9 test_unary_ufuncs few tests enabled on rocm which are passing (#89007)
This PR is to enable tests which are skip on rocm from test package test_unary_ufuncs.py::TestUnaryUfuncsCUDA

<html>
<body>
<!--StartFragment--><div ccp_infra_version='3' ccp_infra_timestamp='1667423453335' ccp_infra_user_hash='1693798314' ccp_infra_copy_id='81491a4a-67e6-4e87-aa71-47d953d2499a' data-ccp-timestamp='1667423453335'><html><head><meta name=ProgId content=Excel.Sheet><meta name=Generator content="Microsoft Excel 15"></head><body link="#0563C1" vlink="#954F72">

test_file | test_name | test_class
-- | -- | --
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_2_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_polygamma_polygamma_n_4_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_large_tan_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_bfloat16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_int8 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_atan_cuda_uint8 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_int8 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_2_cuda_uint8 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_float64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int16 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int32 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int64 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_int8 | (__main__.TestUnaryUfuncsCUDA)
test_unary_ufuncs | test_reference_numerics_small_polygamma_polygamma_n_4_cuda_uint8 | (__main__.TestUnaryUfuncsCUDA)

</body></html></div><!--EndFragment-->
</body>
</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89007
Approved by: https://github.com/mruberry
2022-11-16 17:42:26 +00:00
7f55db4fb0 add quantize_decomposed_dynamic to op lib (#88855)
Summary: Needed for dynamic quant reference pattern graphs.

Test Plan: added unittest

Differential Revision: D41205030

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88855
Approved by: https://github.com/jerryzh168
2022-11-16 16:59:36 +00:00
cf6003f046 Revert "Towards unifying symbolic and non symbolic fake tensor (#89038)"
This reverts commit 37d54239c7ea88fd9c98dcac3fcc9b98a6f9e9d1.

Reverted https://github.com/pytorch/pytorch/pull/89038 on behalf of https://github.com/ezyang due to executorch segfaults
2022-11-16 16:52:47 +00:00
fe276ea0f9 [UCC] Add pre & post processing for CPU collectives (#89030)
Summary: The CPU block in `collective_post` was missing pre & post processing. The reduce-scatter implementaion expects use of pre-processing callback to flatten the input tensors, however, the missing invocation meant grabage values were being passed.

Test Plan: Tested the reduce-scatter collective using PARAM

Reviewed By: eastzone

Differential Revision: D41291592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89030
Approved by: https://github.com/kingchc, https://github.com/kwen2501
2022-11-16 16:40:24 +00:00
90db86be10 Revert "SymIntify convolution backend calculation (#89069)"
This reverts commit 09ed8b67e24cfe29f3fa7b5dd28eaa7749229f12.

Reverted https://github.com/pytorch/pytorch/pull/89069 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2022-11-16 16:36:27 +00:00
cf4b4b1b06 Fix python types in pybind function signatures (#89115)
Fixes #88958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89115
Approved by: https://github.com/ezyang
2022-11-16 16:30:56 +00:00
abe41aee77 [ONNX] Support custom Op with onnx-script local function (#86906)
Extend `register_custom_op` to support onnx-script local function. The FunctionProto from onnx-script is represented by custom op and inserted into ModelProto for op execution.

NOTE: I did experiments on >2GB case of a simple model with large initializers:

```python
import torch

class Net(torch.nn.Module):
    def __init__(self, B, C):
        super().__init__()
        self.layer_norm = torch.nn.LayerNorm((B, C), eps=1e-3)
    def forward(self, x):
        return self.layer_norm(x)

N, B, C = 3, 25000, 25000
model = Net(B, C)
x = torch.randn(N, B, C)

torch.onnx.export(model, x, "large_model.onnx", opset_version=12)
```

And it turns out we won't get model_bytes > 2GB after `_export_onnx` pybind cpp function, as we split initializer in external files in that function, and have serialization before return the model bytes, which protobuf is not allowed to be larger than 2GB at any circumstances.

The test cases can be found in the next PR #86907 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86906
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-16 15:08:55 +00:00
9fe36a0214 [ONNX] Extra support for bernoulli export (#88655)
* add opset 15 support for `bernoulli`.
* add extra export options for different `bernoulli` cases: `x.bernoulli(p)` where `p` is a tensor or float.

Fixes #88299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88655
Approved by: https://github.com/BowenBao
2022-11-16 15:08:41 +00:00
37d54239c7 Towards unifying symbolic and non symbolic fake tensor (#89038)
Fake tensor behaves pretty differently depending on if you have
symbolic shapes or not.  This leads to bugs; for example, we
weren't getting correct convolution_backward strides because we
bypassed the correct stride logic in fake tensor on symbolic
shapes.

This PR attempts to unify the two codepaths.  I don't manage to
unify everything, but I get most of it.  The algorithm is delicate
and I'm still hosing down test failures.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89038
Approved by: https://github.com/anjali411
2022-11-16 14:02:43 +00:00
09ed8b67e2 SymIntify convolution backend calculation (#89069)
We will need this to implement a convolution meta function that
is SymInt aware.  I use templates so that regular convolution code
is not affected by the change.  No tests for symbolic ints directly; that will
come in a subsequent PR which also needs to refactor fake tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89069
Approved by: https://github.com/SherlockNoMad
2022-11-16 14:02:43 +00:00
5e0c01330c SymIntArrayRef type caster (#89074)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89074
Approved by: https://github.com/SherlockNoMad
2022-11-16 14:02:39 +00:00
57af0c8245 Bug fix: make sure copy_impl doesn't read out of bounds (#88544)
Fixes #88543.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88544
Approved by: https://github.com/lezcano
2022-11-16 13:23:38 +00:00
dc40d3f93f Add meta impl for grid_sampler_2d_backward (#88745)
TODO: add an OpInfo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88745
Approved by: https://github.com/ezyang
2022-11-16 13:01:47 +00:00
5270122773 [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#89118)
Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: **1.15x -> 1.36x speedup**

Test Plan: CI

Reviewed By: bertmaher, jansel, jianyuh

Differential Revision: D41071665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89118
Approved by: https://github.com/jianyuh
2022-11-16 10:37:30 +00:00
9d28775c1d Revert "Rewrite assert statement with torch._assert under config (#88246)"
This reverts commit 62ba15e10e875ce088dff26e872605ee70c8c04a.

Reverted https://github.com/pytorch/pytorch/pull/88246 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2022-11-16 09:45:49 +00:00
9d2f5a2784 [dynamo] Support if cond on NNModuleVariable (#89095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89095
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2022-11-16 08:51:30 +00:00
f20b3f2e57 [dtensor] PART 8: move tensor parallel api and tests to core distributed (#88180)
This PR moves tensor/parallel folder and tests to torch.distributed.

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88180
Approved by: https://github.com/aazzolini
2022-11-16 08:07:50 +00:00
0230e52b54 [dtensor] PART 7: move remaining DTensor tests to core distributed (#88179)
This PR moves remaining tests, i.e. tensor_ops, op db tests to core distributed

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88179
Approved by: https://github.com/aazzolini
2022-11-16 08:07:49 +00:00
550a019fb8 [dtensor] PART 6: move DTensor op tests to core distributed (#88551)
This PR moves DTensor op tests to core distributed, including
prop_rule, pointwise op, matrix op tests, etc.

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88551
Approved by: https://github.com/aazzolini
2022-11-16 08:07:48 +00:00
527c5bdb45 [dtensor] PART 5: move DTensor basic tests to core distributed (#88178)
This PR moves DTensor basic tests to torch.distributed, including
dtensor, device_mesh tests

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88178
Approved by: https://github.com/fduwjj
2022-11-16 08:07:46 +00:00
1b88476320 [dtensor] PART 4: move remaining DTensor ops to core distributed (#88550)
This PR moves the view related DTensor ops to core distributed,
tests will be add in follow up PRs

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88550
Approved by: https://github.com/fduwjj
2022-11-16 08:07:44 +00:00
2dcf0978a2 [dtensor] PART 3: move most DTensor ops to core distributed (#88177)
This PR moves most DTensor ops to torch.distributed._tensor. We will
add all tests in the following PRs.

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88177
Approved by: https://github.com/fduwjj
2022-11-16 08:07:42 +00:00
4b945967de [dtensor] PART 2: move DTensor abstraction and APIs to core distributed (#88176)
This PR moves the core DTensor abstraction and high level APIs to
torch.distributed._tensor folder, which includes the following:
1. DTensor class
2. high level APIs (distribute_tensor/module)
3. dispatching logic
4. redistribute logic

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88176
Approved by: https://github.com/fduwjj
2022-11-16 08:07:41 +00:00
370fc5cb42 [dtensor] PART 1: move DeviceMesh and placement to core distributed (#88549)
This PR creates `torch.distributed._tensor` package and moves
DeviceMesh, PlacementTypes to it

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88549
Approved by: https://github.com/fduwjj
2022-11-16 08:07:39 +00:00
59ba15f374 Upload CSV test reports from inductor (#89112)
Inductor test report artifacts are now on HUD but its files are in CSV format instead of the default XML files from pytest or unittest that we expect. So this PR uploads both suffixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89112
Approved by: https://github.com/desertfire
2022-11-16 07:44:43 +00:00
7e66d1d6cd [Inductor] Support Shape Padding for aten.mm in Inductor (#89086)
Summary: Support shape padding for aten.mm in Inductor (originally from [#88709](https://github.com/pytorch/pytorch/pull/88709))

Differential Revision: D41315078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89086
Approved by: https://github.com/jianyuh
2022-11-16 06:27:13 +00:00
e2f0648750 Add an option to include actual license terms to the output (#85624)
When building products using PyTorch, it is often required to display license terms for all dependencies.
The feature itself has been implemented in #81500 but it seems there are no options to enable it.
This PR implements the option.

cc/ @mattip @rgommers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85624
Approved by: https://github.com/rgommers, https://github.com/seemethere
2022-11-16 05:07:53 +00:00
8ebbd5a89a Easier to understand event_dim computation (#81396)
Fixes #81254
Only easier to understand, not a real fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81396
Approved by: https://github.com/fritzo, https://github.com/kit1980
2022-11-16 04:38:32 +00:00
ce2f8700ba Symintify numel(), infer_size, prims.elementwise_meta (#88956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88956
Approved by: https://github.com/ezyang
2022-11-16 03:36:00 +00:00
b291c1213a Create native function for determining which implementation of SDP to call (#89029)
# Summary
Creates a callable native function that can determine which implementation of scaled dot product will get called. This allows to bump re-order the runtime dispatch of SDP to enable autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89029
Approved by: https://github.com/cpuhrsch
2022-11-16 03:07:54 +00:00
397f100672 [FSDP] Test named_parameters() in forward (use_orig_params=True) (#89066)
This adds a unit test following the FSDP change in https://github.com/pytorch/pytorch/pull/88781.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89066
Approved by: https://github.com/fegin
2022-11-16 03:01:16 +00:00
46ba0150cb Increase slow grad check timeout (#89079)
Now that periodic jobs are run under `mem_leak_check` mode with parallelization turning off. It's very easy for `linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck / test` to timeout because one of the shards is very close to the 4h mark.

* 2452e3f99a
* 35e668b5ce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89079
Approved by: https://github.com/clee2000
2022-11-16 02:39:22 +00:00
9f0b2c73f3 Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859)"
This reverts commit d60abe4b9521e235c0e9beb00cda0d6c5673f4e0.

Reverted https://github.com/pytorch/pytorch/pull/88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI
2022-11-16 01:13:00 +00:00
d96dd8ff09 Add int64_t, SymInt overloads for all binary operators in C++ (#89063)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89063
Approved by: https://github.com/SherlockNoMad
2022-11-16 01:08:31 +00:00
431642111f Move ConvParams methods directly on struct (#89062)
This reduces boilerplate.  Also, I plan to add a template
parameter to ConvParams; without moving the methods onto the
struct, I would have to manually template every method.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89062
Approved by: https://github.com/SherlockNoMad
2022-11-16 01:08:31 +00:00
49f0be0762 Hide ConvParams struct from ConvUtils.h (#89059)
It isn't actually used outside of Convolution.cpp, so no reason
to publish it.  I intend to turn this into a template, so moving
it with the method definitions is very convenient.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89059
Approved by: https://github.com/SherlockNoMad
2022-11-16 01:08:27 +00:00
19cacecf34 Fix and Re-enable test_quantize_fx_lite_script_module.py (#88897)
Summary: After D35984526 (416899d1a9), ```torch.ao.quantization.quantize_fx.prepare_fx``` requires passing in  ```example_args```. This diff fixes the calls to ```prepare_fx``` in this test by adding in ```example_args``` as necessary.

Test Plan:
```
buck test caffe2/test:fx_quantization_lite
```

```
  ✓ ListingSuccess: caffe2/test:fx_quantization_lite : 3 tests discovered (39.689)
    ✓ Pass: caffe2/test:fx_quantization_lite - test_conv2d (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (44.451)
    ✓ Pass: caffe2/test:fx_quantization_lite - test_embedding (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (45.462)
    ✓ Pass: caffe2/test:fx_quantization_lite - test_submodule (mobile.test_quantize_fx_lite_script_module.TestLiteFuseFx) (45.933)
Summary
  Pass: 3
  ListingSuccess: 1
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/3096224827259146
```

Differential Revision: D41227335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88897
Approved by: https://github.com/dagitses
2022-11-16 00:56:12 +00:00
3bc327993f PyDispatcher integration with functorch (#88785)
This PR teaches PyDispatcher and PyOperator about functorch transforms.
It is important that PyDispatcher/PyOperator dispatch with functorch
transforms, because this is our plan for higher-order operators
(operators that accept functions as arguments). Examples of these
include:
- functorch transforms over the existing cond operator (control flow)
- autograd.Function support for functorch (which I am working towards),
- AOTDispatcher (should be a higher order operator)

Concretely, the problem with teaching PyDispatcher/PyOperator about
functorch is that the stack-based dispatching logic (DynamicLayerStack)
is hidden inside the fallbacks for two dispatch keys
(DynamicLayer{Front, Back}). PyDispatcher doesn't know about C++ boxed
fallbacks, our plan on record for that is that we need to reimplement
all of them in Python (but can call helper functions in C++ to make our
lives easier).

Instead of exposing all of what DynamicLayer{Front, Back} do to python,
this PR takes the approach of re-implementing part of the stack-based
dispatching in Python. The motivation is that this is more sane and
follows what the "ideal" implementation of functorch would have been:
- each transform should be a "mode"
- there should be no TLS dispatch key set hackery. functorch needs to do
this hackery today to re-use VariableType implementations.

This PR:
- exposes the DynamicLayerStack to Python
- The DynamicLayerStack is a stack of Interpreters.
These get exposed to Python as well.
- Interpreters can run operations (Interpreter.process) or lower them to
the next interpreter in the stack (Interpreter.lower)
- To use a PyOperator with functorch transforms, a developer needs to
register a rule for each transform (vmap, grad, jvp, ...).
- The PyOperator API is NOT user-facing. Things like autograd.Function
support for functorch will end up going through the autograd.Function
API.

Question for reviewers:
- Does this design make sense?
- I'm trying to split up the "functorch support for autograd.Function"
work into logical pieces. Would it be better if I didn't? (the full
thing is a bit long - 1000-2000 LOC).

Test Plan:
- new tests that construct PyOperator and compose them with functorch
transforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88785
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-11-16 00:46:59 +00:00
2268a3215c [functorch] add switch to enable autograd.Function (#88784)
This is mostly a debug or "if you know what you're doing" switch for
now. It is not public API.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88784
Approved by: https://github.com/samdow, https://github.com/soulitzer
2022-11-16 00:46:59 +00:00
0ce22574b1 Revert "Enable correct supported activities for kineto on rocm (#88207)"
This reverts commit 35093fc1ab9749e6b763acead007e56b54c6375b.

Reverted https://github.com/pytorch/pytorch/pull/88207 on behalf of https://github.com/kit1980 due to Broke test_kineto on trunk / win-vs2019-cuda11.6-py3 / test (default, 4, 5, windows.8xlarge.nvidia.gpu)
2022-11-16 00:45:41 +00:00
a13433940c allow loading model from a path in torchbench (#89028)
Sometimes it's really convenient to run simple models thru the torchbench.py script rather than those from pytorch/benchmark. This PR add the ability to run any model from a specified path by overloading the --only argument.

This PR is split out from #88904

Here is the usage:

        Specify the path and class name of the model in format like:
        --only=path:<MODEL_FILE_PATH>,class:<CLASS_NAME>

        Due to the fact that dynamo changes current working directory,
        the path should be an absolute path.

        The class should have a method get_example_inputs to return the inputs
        for the model. An example looks like
        ```
        class LinearModel(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(10, 10)

            def forward(self, x):
                return self.linear(x)

            def get_example_inputs(self):
                return (torch.randn(2, 10),)
        ```

Test command:
```
# python benchmarks/dynamo/torchbench.py --performance --only=path:/pytorch/myscripts/model_collection.py,class:LinearModel --backend=eager
WARNING:common:torch.cuda.is_available() == False, using CPU
cpu  eval  LinearModel                        0.824x p=0.00
```

Content of model_collection.py
```
from torch import nn
import torch

class LinearModel(nn.Module):
    """
    AotAutogradStrategy.compile_fn ignore graph with at most 1 call nodes.
    Make sure this model calls 2 linear layers to avoid being skipped.
    """
    def __init__(self, nlayer=2):
        super().__init__()
        layers = []
        for _ in range(nlayer):
            layers.append(nn.Linear(10, 10))
        self.layers = nn.Sequential(*layers)

    def forward(self, x):
        return self.layers(x)

    def get_example_inputs(self):
        return (torch.randn(2, 10),)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89028
Approved by: https://github.com/jansel
2022-11-16 00:29:08 +00:00
60ffeb9866 Don't iterate over graph when adding graph input (#89084)
helps with https://github.com/pytorch/torchdynamo/issues/1803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89084
Approved by: https://github.com/jansel
2022-11-16 00:08:34 +00:00
ee05f47bdd Rebase and re-land thread PG (#88795)
The previous PR (https://github.com/pytorch/pytorch/pull/88627) has been reverted due to a failed check. After rebasing and rerun, all checks passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88795
Approved by: https://github.com/huydhn, https://github.com/wanchaol
2022-11-15 21:58:58 +00:00
35093fc1ab Enable correct supported activities for kineto on rocm (#88207)
A compile time guard was preventing ActivityType::CUDA from being available on rocm.  This caused both the GPU_FALLBACK and CUDA modes to be active at the same time.  So operators were being charged gpu time for the hipEventRecord ranges and the actual kernel execution times.  This caused incorrect (and often negative) cuda times, in e.g. table().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88207
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2022-11-15 21:40:47 +00:00
d0130cd21e Enable test_ops for inductor (#88994)
Summary: skip several unsupported test cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88994
Approved by: https://github.com/Krovatkin
2022-11-15 21:40:36 +00:00
67af734ade skip test that is broken in head (#88759)
Test Plan: Rely on CI.

Differential Revision: D41156351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88759
Approved by: https://github.com/zou3519
2022-11-15 21:33:38 +00:00
175b7e1cde print xpass (#89020)
Print unexpected success as XPASS.  I will submit a PR to test-infra so that the log classifier can find these

Ex: https://github.com/pytorch/pytorch/actions/runs/3466368885/jobs/5790424173
```
  test_import_hipify (__main__.TestHipify) ... ok (0.000s)
  test_check_onnx_broadcast (__main__.TestONNXUtils) ... ok (0.000s)
  test_prepare_onnx_paddings (__main__.TestONNXUtils) ... ok (0.000s)
  test_load_standalone (__main__.TestStandaloneCPPJIT) ... ok (16.512s)

======================================================================
XPASS [4.072s]: test_smoke (__main__.TestCollectEnv)
----------------------------------------------------------------------

----------------------------------------------------------------------
Ran 31 tests in 24.594s

FAILED (skipped=7, unexpected successes=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89020
Approved by: https://github.com/huydhn, https://github.com/seemethere
2022-11-15 21:27:14 +00:00
8dc3353b0b add to(dtype) support for all sparse compressed formats (#89055)
Fixes [#88419](https://github.com/pytorch/pytorch/issues/88419)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89055
Approved by: https://github.com/cpuhrsch
2022-11-15 21:16:18 +00:00
da2afcb1e0 Add test for out-of-bounds Tensor access on GPU (#39211)
Since CUDA context can not recover safely from on-device assert, use `torch.multiprocessing.spawn` to execute a method in another context and verify that it raises unrecoverable error.

As those types of tests are pretty slow (6 seconds on powerful linux box with one GPU) run it only in the slow shard.

Closes https://github.com/pytorch/pytorch/issues/38944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/39211
Approved by: https://github.com/ezyang
2022-11-15 21:06:02 +00:00
d47b94fa8e [inductor] Added bucketize to decomp table (#88348)
These are the benchmark results vs eager

```
[--------------------------- bucketize ----------------------------]
                                                 |  eager  |  decomp
32 threads: --------------------------------------------------------
      ((16384, 1024), (16,)), (True, True)       |    600  |    464
      ((16384, 1024), (16,)), (True, False)      |    542  |    464
      ((16384, 1024), (16,)), (False, True)      |    780  |    731
      ((16384, 1024), (16,)), (False, False)     |    777  |    731
      ((16384, 1024), (64,)), (True, True)       |    624  |    515
      ((16384, 1024), (64,)), (True, False)      |    603  |    515
      ((16384, 1024), (64,)), (False, True)      |    789  |    718
      ((16384, 1024), (64,)), (False, False)     |    786  |    718
      ((16384, 1024), (256,)), (True, True)      |    878  |    820
      ((16384, 1024), (256,)), (True, False)     |    891  |    830
      ((16384, 1024), (256,)), (False, True)     |    897  |    900
      ((16384, 1024), (256,)), (False, False)    |    900  |    900
      ((16384, 1024), (1024,)), (True, True)     |   2000  |   1890
      ((16384, 1024), (1024,)), (True, False)    |   1950  |   1892
      ((16384, 1024), (1024,)), (False, True)    |   1990  |   1962
      ((16384, 1024), (1024,)), (False, False)   |   1990  |   2060
      ((16384, 1024), (4096,)), (True, True)     |   3405  |   3155
      ((16384, 1024), (4096,)), (True, False)    |   3244  |   3154
      ((16384, 1024), (4096,)), (False, True)    |   3282  |   3219
      ((16384, 1024), (4096,)), (False, False)   |   3278  |   3220
      ((16384, 1024), (16384,)), (True, True)    |   4626  |   4672
      ((16384, 1024), (16384,)), (True, False)   |   4629  |   4671
      ((16384, 1024), (16384,)), (False, True)   |   4662  |   4829
      ((16384, 1024), (16384,)), (False, False)  |   4665  |   4824
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88348
Approved by: https://github.com/ngimel
2022-11-15 21:03:28 +00:00
9262d18e1b [inductor] Introduce CSEVariable type and use it to track if Triton variables are scalar (#88347)
This fixes https://github.com/pytorch/torchdynamo/issues/1515

To fix it, we need to keep track of whether a Triton variable is a scalar (so we can not use a mask when doing indirect loads through them). This requires a way of annotating variable names generated by CSE with properties.

So now CSE will use CSEVariable class to keep track of variables and let backends subclass it so they can annotate them with whatever information they want. TritonCSEVariable is such a subclass that track the `is_scalar` property.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88347
Approved by: https://github.com/jgong5, https://github.com/ngimel
2022-11-15 20:52:37 +00:00
edd2dea859 [torch] [analytics] add dynamo to analytics (#88915)
Summary: as title.

Differential Revision: D41237602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88915
Approved by: https://github.com/jansel
2022-11-15 20:46:03 +00:00
3e2ba60ac0 [torch] [analytics] add pytorch event logger callsites to torch.save and torch.load (#89003)
Summary: as title.

Differential Revision: D41239419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89003
Approved by: https://github.com/ezyang, https://github.com/dzhulgakov
2022-11-15 20:36:16 +00:00
d8466964b3 Add range check to multi margin loss target (#89008)
Fixes https://github.com/pytorch/pytorch/issues/88724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89008
Approved by: https://github.com/ngimel
2022-11-15 20:35:51 +00:00
18c1f2f82e [torch] [analytics] add pytorch event logger callsites to transformers and encoder/decoders (#88896)
Differential Revision: D41227275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88896
Approved by: https://github.com/mikekgfb
2022-11-15 20:35:36 +00:00
ff6d2a6d1b Add mem efficient backward (#88856)
# Registers the derivative for mem efficient backward

- Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32
- I also made updates based off of Xformer main branch and flash-attention cutlass branch.
- This will enable the fused backward to be called for scaled dot product attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856
Approved by: https://github.com/cpuhrsch
2022-11-15 20:22:57 +00:00
d60abe4b95 [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859)
Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: **1.15x -> 1.36x speedup**

Differential Revision: D41071665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88859
Approved by: https://github.com/jianyuh, https://github.com/jansel
2022-11-15 19:34:38 +00:00
f5df685090 Enable channels_last_3d on SyncBatchNorm (#88401)
This PR enabled the use of fast channels_last kernels on SyncBatchNorm with channels_last_3d memory format.

With a small benchmark script here https://github.com/pytorch/pytorch/issues/88021#issuecomment-1299059859, on V100, I got

master:
```
DDP channels_last=False, run_forward_backward, time: 0.8945400714874268 sec
DDP channels_last=True, run_forward_backward, time: 1.4736433029174805 sec
```

This PR:
```
DDP channels_last=False, run_forward_backward, time: 0.8927242755889893 sec
DDP channels_last=True, run_forward_backward, time: 0.48697471618652344 sec
```

This PR is a follow-up of https://github.com/pytorch/pytorch/pull/46906

Close https://github.com/pytorch/pytorch/issues/88021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88401
Approved by: https://github.com/ngimel
2022-11-15 19:25:53 +00:00
8023c9dc64 [Profiler] Memory profiler part 3: Schema parsing and mutable arguments (#86854)
The appropriate annotation for a block of memory is a function of time: an input can be mutated in-place to become an activation, a clever kernel might steal the memory of a detached input (such as a mask) to use as output memory, etc.

We could pessimistically assume that all ops mutate all of their inputs, however inspection of schema allows us to significantly narrow that assumption with minimal effort. Checking schemas also allows us to distinguish between dispatcher ops (which have load bearing semantics) and user annotations with reasonably high precision.

Differential Revision: [D40220390](https://our.internmc.facebook.com/intern/diff/D40220390/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86854
Approved by: https://github.com/chaekit
2022-11-15 19:17:57 +00:00
2439bc1e9b [Profiler] Memory profiler part 2: Config validation (#86853)
Memory profiling requires `record_shapes`, `profile_memory`, and `with_stack`. This PR just adds a skeleton endpoint with a good error message if certain flags are missing.

Differential Revision: [D39920801](https://our.internmc.facebook.com/intern/diff/D39920801/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86853
Approved by: https://github.com/chaekit
2022-11-15 19:17:57 +00:00
279dcce702 disable test that fails in fbcode (#88786)
Summary:
caffe2/test:torch_cuda - test_advanced_indexing_assignment_lazy (test_view_ops.TestViewOpsLAZY)
RuntimeError: TorchScript backend not yet supported in FBCODE/OVRSOURCE builds
  File "/usr/local/fbcode/platform010/lib/python3.8/unittest/suite.py", line 163, in _handleClassSetUp
    setUpClass()
  File "/re_cwd/fbcode/buck-out/opt/gen/caffe2/test/torch_cuda#binary,link-tree/torch/testing/_internal/common_device_type.py", line 506, in setUpClass
    torch._lazy.ts_backend.init()
  File "/re_cwd/fbcode/buck-out/opt/gen/caffe2/test/torch_cuda#binary,link-tree/torch/_lazy/ts_backend.py", line 6, in init
    torch._C._lazy_ts_backend._init()

Test Plan: Rely on CI.

Differential Revision: D41170545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88786
Approved by: https://github.com/zou3519
2022-11-15 19:08:31 +00:00
1db0f735e8 [Profiler] Account for caching when assigning IDs (#88917)
The python tracer caches information about module and optimizer state. That means that for subsequent calls, the presence of a Tensor in these fields does not imply that the Tensor is still live; just that it was live during the first call. (I should perhaps rename the fields to something like `stale_parameters` to convey this.) Unless we discard subsequent calls ID assignment get tripped up when it see's a Tensor that was already released.

Differential Revision: [D41226827](https://our.internmc.facebook.com/intern/diff/D41226827/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88917
Approved by: https://github.com/chaekit
2022-11-15 18:24:15 +00:00
ee4412381e Allow ROCm runners to have 2 or more gpus (#89011)
[This run](https://github.com/pytorch/pytorch/actions/runs/3432340660/jobs/5721731207) failed claiming that it couldn't detect GPUs on the runner. Inspecting the rocminfo output (higher up in logs) show that it in fact had three GPUs, but the workflow is currently setup to expect either 2 or 4 gpus.

The workflow files currently have no way of specifying wither it'll get a 2 gpu or a 4 gpu machine, so really 2 is all any test can expect to get. [This old PR](https://github.com/pytorch/pytorch/pull/72142/files) shows that historically ROCm runners only had 4 gpus, then later the logic was extended to expect 2 GPU runners as well.

It's not clear how the ROCm runner ended up with 3 gpus instead of 2 or 4 (something for ROCm folks to look into) but there doesn't seem to be a good reason for ROCm workflows to fail if 3 (or 5) gpus ever show up on a machine. This PR makes the workflows resilient to ROCm having these alternate GPU counts

Also filed https://github.com/pytorch/pytorch/issues/89012 against the ROCm team to explore why the runner only had 3 gpus

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89011
Approved by: https://github.com/huydhn
2022-11-15 17:55:29 +00:00
2819df9a19 [ROCm] Enable python ref executor UTs for ROCm (#88981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88981
Approved by: https://github.com/mruberry
2022-11-15 17:49:00 +00:00
62ba15e10e Rewrite assert statement with torch._assert under config (#88246)
This diff rewrites assert statement in python with torch._assert under config. The resulting graph looks something like:
```
SOURCE CODE:
def f(x):
      assert x[0] == 3
      return x.cos()

CAPTURED GRAPH:
graph():
    %arg0 : [#users=2] = placeholder[target=arg0]
    %getitem : [#users=1] = call_function[target=operator.getitem](args = (%arg0, 0), kwargs = {})
    %eq : [#users=1] = call_function[target=operator.eq](args = (%getitem, 3), kwargs = {})
    %_assert : [#users=0] = call_function[target=torch._assert](args = (%eq, "assertion_error"), kwargs = {})
    %cos : [#users=1] = call_method[target=cos](args = (%arg0,), kwargs = {})
    return cos
 ```
Note that this introduces side-effect as it could error out while executing graph, but the assertion can eliminated via DCE if we choose to ignore it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88246
Approved by: https://github.com/jansel
2022-11-15 17:14:59 +00:00
b815f1fc50 Symintify view_as_complex and view_as_real (#89052)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #89052
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89052
Approved by: https://github.com/ezyang
2022-11-15 16:28:36 +00:00
b9029fc449 [ao] quant_type.py fixing public v private (#87519)
Summary: made _get_quant_type_to_str private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709282](https://our.internmc.facebook.com/intern/diff/D40709282)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87519
Approved by: https://github.com/jcaip
2022-11-15 15:42:31 +00:00
5faa2792fa Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
Approved by: https://github.com/ezyang
2022-11-15 13:34:45 +00:00
63e16216d8 [c10d] Implement __instancecheck__ for c10d::ReduceOp (#88275)
Summary:
- Customize the metaclass of `torch.distributed.distributed_c10d.ReduceOp` for the sake of custom `__instancecheck__`
- Add `copy.copy`, `copy.deepcopy`, and `pickle` support with tests

Rel:
- #81272
- #84243
- #87191
- #87303
- #87555

Ref:
- https://github.com/pybind/pybind11/issues/2696

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88275
Approved by: https://github.com/wanchaol
2022-11-15 13:21:41 +00:00
2452e3f99a Update xnnpack graph schema to use xnode and xvalue (#89036)
There are different nodes definition like [Node in autograd](https://www.internalfb.com/code/fbsource/fbcode/caffe2/torch/csrc/autograd/function.h?lines=108-609&reveal=108-609) and onnxnodes and etc. Understand namespace can be used where nodes from definition are used together, however it's still better to slightly differentiate the name.

Differential Revision: [D41002324](https://our.internmc.facebook.com/intern/diff/D41002324/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89036
Approved by: https://github.com/mcr229
2022-11-15 10:34:45 +00:00
8c46a5de3a Add debug handle to xnnpack schema (#89033)
As title, add three things to the schema
1. debug handle for each node
2. file identifier, so we can sanity check we are getting the xnnpack schema flatbuffers file, instead of other random binary
3. extension, so the dumped binary will end up with its own extension like `myschema.xnnpack` (maybe can have a better name) instead of the default extension `.bin`

Differential Revision: [D40906970](https://our.internmc.facebook.com/intern/diff/D40906970/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89033
Approved by: https://github.com/mcr229
2022-11-15 09:49:54 +00:00
50c18217a3 Revert "Add mem efficient backward (#88856)"
This reverts commit 35e668b5ced25e735b6e523d557ed7fd60267914.

Reverted https://github.com/pytorch/pytorch/pull/88856 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2022-11-15 09:37:09 +00:00
5314af5383 Set correct size of attr::output_layouts when the graph has multiple outputs in JIT oneDNN fuser (#88496)
Bug:
Previously, `initOutputLayouts()` was called after creating a graph and before merging other nodes. It is a vector with one element. So when a graph contains multiple outputs, e.g. using AOTAutograd compile in my case, layout_propagation pass try to access out of range elements in the vector. Then it comes to the second bug in `useOpaqueLayout()`, the out of range checks the index with the updated output size instead of the size of the vector. Then used `[]` to access the element, which is out of range.

Fixes the above two issues:

1. check the offset is within range with the size of `attr::output_layouts` vector instead of another variable. This check catches the error now.
2. change the place to initial `attr::output_layouts` after node merging. The graph may change with node merging. Thus we moved the initialization in layout_propagation with the complete graph.

Added test time:
`Ran 1 test in 0.383s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88496
Approved by: https://github.com/jgong5, https://github.com/sanchitintel
2022-11-15 07:29:55 +00:00
60e59c0755 Fix get_default_qat_qconfig for PT 1.13 (#88876)
See https://github.com/pytorch/pytorch/pull/84329/files#r1019916766 for more context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88876
Approved by: https://github.com/jgong5, https://github.com/vkuzo
2022-11-15 06:36:24 +00:00
5ed90c40f8 enable index_put test (#89019)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89019
Approved by: https://github.com/desertfire
2022-11-15 06:16:15 +00:00
68fd8f3706 [BE] [c10d][send] Improve error message on dist.send() with destination rank as itself (#89004)
This improves error msg on dist.send() and add corresponding test in test_c10d_common.py(https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_common.py).
Context in issue#83912: https://github.com/pytorch/pytorch/issues/83912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89004
Approved by: https://github.com/H-Huang
2022-11-15 06:13:17 +00:00
21dd311077 Add a mode to rerun all disabled tests (without running anything else) (#88646)
Rerun all disabled test to gather their latest result so that we can close disabled tickets automatically. When running under this mode (RERUN_DISABLED_TESTS=true), only disabled tests are run while the rest are skipped `<skipped message="Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run" type="skip"/>`

The logic is roughly as follows, the test runs multiple times (n=50)

* If the disabled test passes, and it's flaky, do nothing because it's still flaky.  In the test report, we'll see the test passes with the following skipped message:
```
<testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00">
    <skipped message="{&quot;flaky&quot;: True, &quot;num_red&quot;: 4, &quot;num_green&quot;: 0, &quot;max_num_retries&quot;: 3, &quot;rerun_disabled_test&quot;: true}" type="skip"/>
</testcase>
```

* If the disabled test passes every single time, and it is not flaky anymore, mark it so that it can be closed later.  We will see the test runs and passes, i.e.
```
<testcase classname="TestCommonCUDA" name="test_out_warning_linalg_lu_factor_cuda" time="0.170" file="test_ops.py" />
```

* If the disabled test fails after all retries, this is also expected. So only report this but don't fail the job (because we don't care about red signals here), we'll see the test is skipped (without the `flaky` field), i.e.
```
<testcase classname="TestMultiprocessing" file="test_multiprocessing.py" line="357" name="test_fs" time="0.000" timestamp="0001-01-01T00:00:00">
    <skipped message="{&quot;num_red&quot;: 4, &quot;num_green&quot;: 0, &quot;max_num_retries&quot;: 3, &quot;rerun_disabled_test&quot;: true}" type="skip"/>
</testcase>
```

This runs at the same schedule as `mem_leak_check` (daily).  The change to update test stats, and (potentially) grouping on HUD will come in separated PRs.

### Testing

* pull https://github.com/pytorch/pytorch/actions/runs/3447434434
* trunk https://github.com/pytorch/pytorch/actions/runs/3447434928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88646
Approved by: https://github.com/clee2000
2022-11-15 05:08:26 +00:00
73d71ae3d6 [WIP] Unwrap View in Reinterpret View (#89016)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89016
Approved by: https://github.com/ngimel
2022-11-15 04:40:13 +00:00
dd6beca854 Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils test. (#83693)
Changing the use from ASSERT_EQ to ASSERT_FLOAT_EQ on nn_utils.cpp:ClipGradNorm as this is the proper way to compare equality between floating point values. This avoids `test_api` ClipGradNorm failing for WoA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83693
Approved by: https://github.com/ngimel, https://github.com/kit1980
2022-11-15 04:10:52 +00:00
ce8a45c282 [vision hash update] update the pinned vision hash (#89026)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89026
Approved by: https://github.com/pytorchbot
2022-11-15 03:32:03 +00:00
55b88cde0a [Inductor] Build Shape Padding in Inductor (#88709)
Summary: Build shape padding for matmul/bmm/addmm in Inductor

Differential Revision: D41071282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88709
Approved by: https://github.com/bertmaher, https://github.com/Chillee
2022-11-15 03:10:36 +00:00
cbdb683dc8 Add test that bias gradient is properly tested in same_two_models (#88995)
See
https://github.com/pytorch/pytorch/pull/88629#issuecomment-1313850324
for why this got broken.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88995
Approved by: https://github.com/albanD
2022-11-15 02:55:43 +00:00
45d2daaf85 Fix lookup file update in dashboard (#89024)
Lookup file should be updated before graphs are generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89024
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2022-11-15 02:32:55 +00:00
1f88b208ac Fix cuda/cpu check on NoneType (Unit test) (#88970)
Summary: Fix cuda/cpu check on NoneType (unit test)

Test Plan: sabdcastle/ github CI/CD

Differential Revision: D41208798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88970
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2022-11-15 01:25:19 +00:00
35e668b5ce Add mem efficient backward (#88856)
# Registers the derivative for mem efficient backward

- Use gradcheck to test correctness. The kernel is not implemented for fp64 so run checks with bumped tolerances in fp32
- I also made updates based off of Xformer main branch and flash-attention cutlass branch.
- This will enable the fused backward to be called for scaled dot product attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88856
Approved by: https://github.com/cpuhrsch
2022-11-15 01:10:35 +00:00
f3462833bd Use same retry logic as macos binary builds (#89014)
Occasionally the command to download sccache via curl fails with network errors (example below). The default curl retry option only retries errors that are considered "transient", but but the set of actual transient commands is greater than what curl considers to be transient.

This PR modifies the retry logic for downloading sccache to match what's in https://github.com/pytorch/pytorch/blob/master/.github/templates/macos_binary_build_workflow.yml.j2#L79-L89, using the retry action to ensure we both retry all transient errors, and including a longer retry delay to give the transient issue time to resolve itself.

Example failure from [this run](https://github.com/pytorch/pytorch/actions/runs/3422664884/jobs/5700595220):
```
Run sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:11 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:13 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:14 --:--:--     0
curl: (35) OpenSSL SSL_connect: Connection reset by peer in connection to s3.amazonaws.com:443
Error: Process completed with exit code 35.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89014
Approved by: https://github.com/huydhn
2022-11-15 01:01:40 +00:00
7a37bbed15 Take input striding for conv fusion op based on eager output (#88864)
As https://github.com/pytorch/pytorch/pull/88706, we also change the input stride check using eager output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88864
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-15 00:55:07 +00:00
0544a32ba3 [inductor] fix could not find as_strided with config.triton.mm=triton (#88946)
Summary: ReinterpretView doesn't seem to be handled properly with matrix multiply Triton kernels

Reviewed By: bertmaher

Differential Revision: D40836677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88946
Approved by: https://github.com/jansel
2022-11-15 00:48:49 +00:00
92c78f37af improving torch.linalg.lstsq documentation formatting (#89013)
Fixes #80441

The highlighting in the documentation for torch.linalg.lstsq was incorrect due to a newline that sphinx doesn't parse correctly.  Instead of writing the tensors directly, I used randn to generate the tensors.  This seems to be more consistent with how other documentation is written.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89013
Approved by: https://github.com/lezcano
2022-11-14 23:58:46 +00:00
8df64abc6d Fix some naughty uses of reshape/flatten (#88999)
Mutating after reshape/flatten is bad! And it turns out
the corresponding view operations are guaranteed to work
too.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88999
Approved by: https://github.com/albanD
2022-11-14 23:38:35 +00:00
c53a5ac6cc Revert "support running test_mobile_profiler with buck1/buck2 and OSS (#89001)"
This reverts commit 3b33a2794e07b5216aa473da67755af3aa6e6433.

Reverted https://github.com/pytorch/pytorch/pull/89001 on behalf of https://github.com/kit1980 due to Broke trunk / macos-12-py3-x86-64-lite-interpreter / build
2022-11-14 23:36:17 +00:00
3c3bd55bea [testing] fix a key in parse_namespace() (#88969)
This PR fixes an incorrect key name of `mappings` dict in `parse_namespace()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88969
Approved by: https://github.com/kit1980
2022-11-14 23:24:34 +00:00
911a1349dd [Dynamo] Fix torch.is_tensor and torch.overrides.is_tensor_like (#88704)
Fixes error from 7k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_arashwan_matrixnet.py

Error:
```
AssertionError: torch.* op returned non-Tensor bool call_function <function is_tensor at 0x7fca94d0faf0>

from user code:
   File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_arashwan_matrixnet.py", line 749, in scatter
      return scatter_map(inputs)
   File "/scratch/ybliang/work/repos/pytorch-jit-paritybench/generated/test_arashwan_matrixnet.py", line 741, in scatter_map
      assert not torch.is_tensor(obj), 'Tensors not supported in scatter.'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88704
Approved by: https://github.com/jansel
2022-11-14 22:45:50 +00:00
3b33a2794e support running test_mobile_profiler with buck1/buck2 and OSS (#89001)
Summary:
Internally we are switching to a new version of buck, but we also must
keep this working in OSS.

Test Plan: Rely on CI.

Differential Revision: D41270673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89001
Approved by: https://github.com/r-barnes, https://github.com/osalpekar, https://github.com/malfet
2022-11-14 22:11:29 +00:00
074278f393 [CI] Push latest and hash+CUDAver tags (#88971)
For nightly docker build to simulate the behavior of `push_nightly_docker_ghcr.yml`

Tested in https://github.com/pytorch/pytorch/actions/runs/3465221336/jobs/5787694933

Fixes https://github.com/pytorch/pytorch/issues/88833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88971
Approved by: https://github.com/seemethere
2022-11-14 21:54:46 +00:00
b2082833c6 Revert "woof (#89010)"
This reverts commit 4570bd6030c97577d2fa994857d0a022ef7563a4.

Reverted https://github.com/pytorch/pytorch/pull/89010 on behalf of https://github.com/ezyang due to whoops this actually landed
2022-11-14 21:21:09 +00:00
4570bd6030 woof (#89010)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D41276175](https://our.internmc.facebook.com/intern/diff/D41276175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89010
Approved by: https://github.com/bigfootjon
2022-11-14 20:58:27 +00:00
f80992217d Remove skip (#88979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88979
Approved by: https://github.com/voznesenskym
2022-11-14 20:56:17 +00:00
540b42a1a8 [quant][executorch] Support quant fusion for cat in quant in executorch stack (#88960)
Summary:
* added cat in executorch backend config
* added quant fusion for "dq - cat - q" pattern

Test Plan: buck run executorch/exir/tests:quant_fusion_pass -- "executorch.exir.tests.test_quant_fusion_pass.TestQuantFusionPass.test_cat"

Reviewed By: qihqi

Differential Revision: D41111054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88960
Approved by: https://github.com/JacobSzwejbka
2022-11-14 19:27:46 +00:00
e0c194f10b Fix typos in messages under torch (#88961)
This PR fixes typos of messages and parms in c++ source and head files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88961
Approved by: https://github.com/albanD
2022-11-14 19:06:41 +00:00
3d79ced8cf wrap_pybind_function: support member function pointers (#88932)
This updates `wrap_pybind_function` to use `invoke` and adds the
`invoke_traits` object which is analogous to `function_traits` but
for member functions it includes the class as an explicit argument.

To test this is working properly, I've also applied it to the
`CUDAGraph` binding code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88932
Approved by: https://github.com/albanD
2022-11-14 18:47:34 +00:00
36d87465fb Fix long comment error on dashboard (#89002)
Fix dashboard comment failure due to the following trace:
```
Traceback (most recent call last):
  File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1180, in <module>
    DashboardUpdater(args).update()
  File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1119, in update
    self.comment_on_gh(comment)
  File "/scratch/anijain/dashboard/work/pytorch/benchmarks/dynamo/runner.py", line 1096, in comment_on_gh
    subprocess.check_call(
  File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 368, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 349, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/scratch/anijain/dashboard/env/lib/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: '/data/home/anijain/miniconda/bin/gh'
srun: error: a100-st-p4d24xlarge-27: task 0: Exited with exit code 1
```
That is, we were trying to execute a gh command in the OS that was too long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89002
Approved by: https://github.com/davidberard98
2022-11-14 18:43:50 +00:00
cdb798faef _get_nested_attr should return a value in the general case (#88822)
Fixes https://github.com/pytorch/functorch/issues/1053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88822
Approved by: https://github.com/zou3519
2022-11-14 18:39:45 +00:00
f1a5044de0 [primTorch] _refs & opinfo alpha_dropout (#87989)
Add _refs and OpInfo for `nn.functional.alpha_dropout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87989
Approved by: https://github.com/mruberry
2022-11-14 18:18:45 +00:00
b0c86caa1d Remove cpu path from lobpcg's basis helper (#88984)
Fixes https://github.com/pytorch/pytorch/issues/88650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88984
Approved by: https://github.com/lezcano
2022-11-14 17:49:30 +00:00
06f1b52705 don't use prims.unsqueeze in group_norm (#88927)
inductor doesn't have prims.squeeze lowering, so this breaks it. Longer term, `squeeze` with multiple dimensions is not a prim, nvfuser implements it with a loop, inductor uses `_squeeze_multiple` helper which turns it into a loop. Prim should accept only a single dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88927
Approved by: https://github.com/eellison
2022-11-14 17:37:24 +00:00
c8f3d1c134 Run test_torchinductor_opinfo CPU tests if triton not installed (#88934)
These test are not run currently because normal CI workers don't have
triton installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88934
Approved by: https://github.com/ngimel
2022-11-14 15:49:34 +00:00
ec4eadac5b reland "Do not use unsafe restriding for subclasses (#87610)" (#88343)
This reverts commit 5b75b19f51837e162cc0e5e5757dfd9bef437c67.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88343
Approved by: https://github.com/ezyang
2022-11-14 13:42:51 +00:00
9943d46aab TorchDynamo: skip convolution fusion when convolution's padding is string (#88794)
Currently,  the fusion convolution doesn't support the case when padding is a string, we will support it at the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88794
Approved by: https://github.com/jansel, https://github.com/jgong5
2022-11-14 12:39:47 +00:00
15ef0660c5 Fake Tensor For (ConvFusion) Propagation (#88414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88414
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-14 12:35:09 +00:00
5e6cefd258 Revert "Run test_torchinductor_opinfo CPU tests if triton not installed (#88934)"
This reverts commit 8371bb8a3dddbead709bc1e9d26715818a34fa8a.

Reverted https://github.com/pytorch/pytorch/pull/88934 on behalf of https://github.com/peterbell10 due to Inductor tests failing on master
2022-11-14 12:02:43 +00:00
8371bb8a3d Run test_torchinductor_opinfo CPU tests if triton not installed (#88934)
These test are not run currently because normal CI workers don't have
triton installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88934
Approved by: https://github.com/ngimel
2022-11-14 10:51:12 +00:00
072920c281 TorchDynamo: Add convolution binary+unary fusion for cpu in inference mode (#88412)
This PR is about enabling the fusion of **conv+binary+relu**, which will improve the vision model's performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88412
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-14 10:35:41 +00:00
cb4842c949 [xla hash update] update the pinned xla hash (#88982)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88982
Approved by: https://github.com/pytorchbot
2022-11-14 10:29:26 +00:00
03296844aa Fix typos in messages under aten (#88964)
This PR fixes typos of messages and parms in c++ source files under `aten` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88964
Approved by: https://github.com/lezcano
2022-11-14 09:50:50 +00:00
4ad7b17fab TorchDynamo: Add convolution binary(inplace) fusion for cpu in inference mode (#88403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88403
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-14 08:42:40 +00:00
06486cd008 fix typo: AT_MKLDNN_EBABLED => AT_MKLDNN_ENABLED (#88952)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88952
Approved by: https://github.com/XiaobingSuper
2022-11-14 03:39:46 +00:00
eea506aee1 Revert "Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)"
This reverts commit 9eabcc370f4c3a04be85cb1f878038f10716bdc3.

Reverted https://github.com/pytorch/pytorch/pull/88761 on behalf of https://github.com/suo due to much broken 9eabcc370f
2022-11-14 01:58:47 +00:00
48dc24ddce Fix: [ATen] Add some missing moves (#88514)
Related to #88512 , but for ATen. This should reduce a number of copies and inefficient atomic smart pointer increments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88514
Approved by: https://github.com/jgong5, https://github.com/ezyang
2022-11-13 22:05:41 +00:00
9eabcc370f Symintify decomps for split and upsample_bilinear; Fix decomp for _softmax_backward_data and native_dropout_backward (#88761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88761
Approved by: https://github.com/ezyang
2022-11-13 21:30:53 +00:00
76af71444a [primTorch] Add ref for complex (#88562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88562
Approved by: https://github.com/ezyang
2022-11-13 20:31:16 +00:00
8f7e519f12 Skip dynamo benchmark tests under TSAN (#88895)
Summary: Fixes T137546804

Test Plan:
```
buck2 test mode/opt-tsan //caffe2/benchmarks/dynamo:test
buck2 test mode/opt //caffe2/benchmarks/dynamo:test
```

Differential Revision: D41226384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88895
Approved by: https://github.com/anijain2305
2022-11-13 19:42:42 +00:00
52be0c42ab meta function for max_pool2d_with_indices_backward (#88743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88743
Approved by: https://github.com/lezcano, https://github.com/ezyang
2022-11-13 18:31:56 +00:00
98bcb4acb6 Revert "[reland][dynamo] Better support for nn.Module (#88959)"
This reverts commit e950afc3958c9bae5d61cbc99bc088309141df6d.

Reverted https://github.com/pytorch/pytorch/pull/88959 on behalf of https://github.com/malfet due to Broke `test_accuracy_issue1`
2022-11-13 16:21:14 +00:00
897d029a73 [reland][dynamo] fixes dict changed during runtime error (#88877)
Reland https://github.com/pytorch/pytorch/pull/87526

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88877
Approved by: https://github.com/ezyang
2022-11-13 16:20:45 +00:00
4284862db6 [Dynamo][FSDP] Migrate to ModuleWrapPolicy (#88453)
Hello @wconstab! As you saw, `transformer_auto_wrap_policy()` is a misnomer and actually works for any module classes. The PR before this one tries to add a class `ModuleWrapPolicy` that takes in the `module_classes` in its constructor and works just like `transformer_auto_wrap_policy()` without requiring the `functools.partial()`. I hope you do not mind if we update the dynamo benchmarks util file with this migration.

The PR before this one might require some back and forth within FSDP devs, so I apologize for any consequent updates to this PR, which in itself is an easy change. I will request review once we know the previous PR is good for land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88453
Approved by: https://github.com/wconstab
2022-11-13 14:56:30 +00:00
bca75fd2d3 Move xnnpack taget to fb code base (#88909)
1. Move the source file list to the `build_variables.bzl`, as it's the source of truth for both internal buck build and oss build
2. Move target definitions to `fb` internal folder
3. Some changes are triggered from auto format.

Differential Revision: [D40906961](https://our.internmc.facebook.com/intern/diff/D40906961/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40906961/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88909
Approved by: https://github.com/mcr229
2022-11-13 12:04:35 +00:00
2b12bfce88 [dynamo] Skip frame when graph break in a loop (#88857)
This fixes excessing recompilation issue in tacotron2 but has few caveats - https://github.com/pytorch/torchdynamo/issues/330

For tacotron2, the repro is something like this

~~~
        def inner(x):
            return torch.sin(x)

        def fn(x):
            for _ in range(100):
                inner(x)
                torch._dynamo.graph_break()
            return x
~~~

The problem here is that Dynamo has guards on the TUPLE_ITERATOR_LEN whenever a graph break happens. Therefore, we keep on recompiling.

This PR checks if there is a backedge (helps with while loop) in presence of a graph break. If there is, Dynamo skips processing this frame. Therefore, Dynamo gets called when inner is called, and we compile only once.

Note that, if there was no graph break, we will unroll the original loop, and see one graph with 100 sin operations (just as before, so no changes there).

The caveat is - We are skipping the frame, so if we have something like this

~~~
        def fn(x):
            for _ in range(100):
                # 1000s of lines of PyTorch code
                torch._dynamo.graph_break()
            return x
~~~

Dynamo will skip processing this frame, and might miss on the optimization.

Completely open for suggestions. Happy to re-implement if there is a better way to handle this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88857
Approved by: https://github.com/jansel, https://github.com/yanboliang
2022-11-13 09:53:38 +00:00
e950afc395 [reland][dynamo] Better support for nn.Module (#88959)
Relanding https://github.com/pytorch/pytorch/pull/88629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88959
Approved by: https://github.com/msaroufim
2022-11-13 08:19:45 +00:00
06ce1338bc [dynamo] Port all pytorch/dynamo and test/dynamo pieces over from symbolic-shapes branch (#88768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88768
Approved by: https://github.com/jansel, https://github.com/ezyang
2022-11-13 04:50:21 +00:00
4f2639e56a [FSDP] Fix FSDP.clip_grad_norm_() for NO_SHARD (#88955)
This PR fixes `FSDP.clip_grad_norm_()` for `NO_SHARD`, which previously "double-counted" each gradient `world_size`-many times.

This does not address any discrepancies between `FULL_SHARD` and DDP. (Note that the unit tests do show parity between `FULL_SHARD` and DDP when using `FSDP.clip_grad_norm_()` and `nn.utils.clip_grad_norm_()` respectively on one iteration.)

The added unit test code path tests mixing nested FSDP instances with both `FULL_SHARD` and `NO_SHARD` to ensure that the `local_sharded_norm` and `local_nonsharded_norm` computations are interoperating correctly. I want to test non-FSDP root instance in the future, but this is BC breaking since we need to make `clip_grad_norm_()` a static method, which would require a different method call syntax (`FSDP.clip_grad_norm_(root_module, ...)` vs. `root_module.clip_grad_norm_(...)`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88955
Approved by: https://github.com/zhaojuanmao
2022-11-13 02:38:38 +00:00
46796fe5e9 Fix XLA symbolic shapes binding (#88928)
Obsoletes https://github.com/pytorch/pytorch/pull/88772

Mostly revolves around NOT assuming that the inside is a SymNode,
but instead duck-typed to be a SymNode.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88928
Approved by: https://github.com/SherlockNoMad
2022-11-13 00:31:27 +00:00
2aca97cc9a Vectorized CPU code implementing left shift operator. (#88607)
This PR adds vectorized implementation for CPU version of left shift operator.

All of the tests run by `pytest test/test_ops.py -vk left_shift` pass.

Here are some additional details:

<details>
<summary>
Benchmarking script (writen by Philip, with small tweaks by Mario) comparing left shifts with multiplications - on par now
</summary>

```python
import torch
from torch import Tensor
from torch.utils.benchmark import Timer, Compare
from itertools import product
from functools import partial

# These functions exist, because torch.jit.script does not support `torch.iinfo`
def _num_value_bits(dtype):
    if dtype == torch.uint8:
        return 8
    else:  # torch.int32
        return 31

def _max_value(dtype):
    if dtype == torch.uint8:
        return 255
    else:  # torch.int32
        return 2147483647

def bitshift(image, dtype):
    num_value_bits_input = _num_value_bits(image.dtype)
    num_value_bits_output = _num_value_bits(dtype)

    return image.to(dtype).bitwise_left_shift_(num_value_bits_output - num_value_bits_input)

def mul(image, dtype):
    input_max = float(_max_value(image.dtype))
    output_max = float(_max_value(dtype))

    factor = int((output_max + 1) // (input_max + 1))
    image = image.to(dtype)
    return image * factor

size = 256
image = torch.randint(0, 256, (3, size, size), dtype=torch.uint8)
dtype = torch.int32

def gen_inputs():
    devices = ("cpu",)
    fns = (mul, bitshift)
    threads = (1,)
    for device, fn, threads in product(devices, fns, threads):
        yield f"Bitshift {device} {image.dtype}", str(tuple(image.shape)), threads, fn, image, dtype

def benchmark(label, sub_label, threads, f, *args, **kwargs):
    return Timer("f(*args, **kwargs)",
                 globals=locals(),
                 label=label,
                 description=f.__name__,
                 sub_label=sub_label,
                 num_threads=threads).blocked_autorange()

results = []
for args in gen_inputs():
    results.append(benchmark(*args))

compare = Compare(results)
compare.trim_significant_figures()
compare.print()
```
</details>

<details>
<summary>
Test script exercising large number of combinations of left shift operands that I've used for further testing (validates results through comparing with results generated by NumPy)
</summary>

```python
import numpy as np
import torch

# Testing shifting of non-negative numbers only, but will test all
# possible RHS shift values for given type.  For int8 and int16, we'll
# test shifting all of non-negative values represntable by type.  For
# the rest of data types, we'll test shifting some random numbers in
# the corresponding range.
def _create_inputs(dtype):
    info = torch.iinfo(dtype)
    if dtype == torch.int8 or dtype == torch.int16:
        ntests = info.max + 1
        x = torch.arange(info.max + 1, dtype=dtype, device="cpu", requires_grad=False)
    else:
        ntests = 100000
        x = torch.randint(info.max + 1 if dtype != torch.int64 else info.max, (ntests,), dtype=dtype, device="cpu", requires_grad=False)
    y = torch.tensor(range(info.bits), dtype=dtype, device="cpu", requires_grad=False)
    xy = torch.cartesian_prod(x, y)
    return (xy[:, 0], xy[:, 1])

torch.manual_seed(0)

# Perform testing for each datatype supported, and compare results
# with ones generated by numpy.
for dtype in (torch.int8, torch.int16, torch.int32, torch.int64):
    (x, y) = _create_inputs(dtype)
    z = x << y
    xnp = x.numpy()
    ynp = y.numpy()
    znp = z.numpy()
    assert((znp == (xnp << ynp)).all())
```
</details>

<details>
<summary>
Benchmarking script running the left shift operator on tensors of different length (and varying number of bits to shift)
</summary>

```python
import torch
import pickle
import itertools
from torch.utils.benchmark import Timer, Compare

torch.manual_seed(0)

# Edit this part if needed.
lengths = [1024, 4096, 16384, 65536]
rhss = [1, 2, 7, 8, 15, 16, 31, 32, 63, 64]

benchmark_name = "lshift"
label = ""
dtypes = [torch.int8, torch.int16, torch.int32, torch.int64]
results = []

# Create an argument pair for testing.  Argument are tensors of given
# datatype and length, LHS for each shift operation is a random
# number, and RHS is given value that is same for all of them.
def _make_args(dtype, length, rhs):
    info = torch.iinfo(dtype)
    imax = info.max
    return (torch.randint(info.max, (length,), dtype=dtype, device="cpu", requires_grad=False),
            rhs * torch.ones((length,), dtype=dtype, device="cpu", requires_grad=False))

# Run shift operation for vectors of given lenghts and for given
# number of bits to be shifted, and remember timings.
for dtype, length, rhs in itertools.product(dtypes, lengths, rhss):
    x, y = _make_args(dtype, length, rhs)
    timer = Timer("x << y",
                  globals=globals(),
                  label=benchmark_name,
                  description=label,
                  sub_label=f"dtype={dtype},length={length}",
                  num_threads=1)
    results.append(timer.blocked_autorange())

# Gather results.
compare = Compare(results)
compare.trim_significant_figures()
compare.print()

# Print results.
with open("{}.pickle".format(label), "wb") as f:
    pickle.dump(results, f)
```
</details>

<details>
<summary>
Results of running above benchmarking script - results manually merged for runs of viable/strict (labeled "master" in the table below) and my branch (labeled "mybranch" in the table below)
</summary>

```
[------------------- lshift -------------------------------]
                                      |  master	|  mybranch
1 threads: ------------------------------------------------
      dtype=torch.int8,length=1024    |     3  	|      3
      dtype=torch.int8,length=4096    |     5  	|      3
      dtype=torch.int8,length=16384   |    14  	|      5
      dtype=torch.int8,length=65536   |    51  	|     15
      dtype=torch.int16,length=1024   |     3  	|      3
      dtype=torch.int16,length=4096   |     4  	|      3
      dtype=torch.int16,length=16384  |    11  	|      5
      dtype=torch.int16,length=65536  |    39  	|     13
      dtype=torch.int32,length=1024   |     3  	|      2
      dtype=torch.int32,length=4096   |     4  	|      3
      dtype=torch.int32,length=16384  |    10  	|      4
      dtype=torch.int32,length=65536  |    35  	|     12
      dtype=torch.int64,length=1024   |     3  	|      3
      dtype=torch.int64,length=4096   |     4  	|      3
      dtype=torch.int64,length=16384  |    11  	|      6
      dtype=torch.int64,length=65536  |    36  	|     20

Times are in microseconds (us).
```
</details>

All of the testing/benchmarking was conducted on qpu3, that supports AVX2 only.  For basic validation of AVX-512 update of left shift implementation for 8-bit operands (that is the only one that is non-trivial in AVX-512 case), [Compiler Explorer](https://godbolt.org/) is used, with GCC trunk and `-mavx512f -mavx512bw` flags added.  Here are further details:

<details>
<summary>
C program used for basic validation of AVX-512 vectorized version for 8-bit operands
</summary>

```
#include <stdio.h>
#include <stdint.h>
#include <string.h>

#include <immintrin.h>

static void print_m512i_int8(const __m512i* x)
{
    int8_t val[64];
    memcpy(val, x, sizeof(val));
    for (int i = 0; i < 64; ++i) {
        if (i > 0)
            printf(", ");
        printf("%d", (int)val[i]);
    }
    printf("\n");
}

int main()
{
    __m512i a = _mm512_set_epi8(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                                1);
    __m512i b = _mm512_set_epi8(7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6,
                                5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
                                3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2,
                                1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
                                0);

  // ------- Copied code from vec512_int.h

  // Mask used to set upper 8 bits of each 16-bit value to 0, and keep
  // lower 8 bits.
  __m512i mask = _mm512_set_epi16(0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff);

  // Convert 8-bit operands from lower lanes to 16-bit values, and
  // perform vectorized shift.  Make sure that upper 8 bits of 16-bit
  // results are all 0.
  __m256i a_lo_8 = _mm512_extracti64x4_epi64(a, 0);
  __m256i b_lo_8 = _mm512_extracti64x4_epi64(b, 0);
  __m512i a_lo_16 = _mm512_cvtepi8_epi16(a_lo_8);
  __m512i b_lo_16 = _mm512_cvtepi8_epi16(b_lo_8);
  __m512i c_lo_16 = _mm512_and_si512(_mm512_sllv_epi16(a_lo_16, b_lo_16), mask);

  // Convert 8-bit operands from upper lanes to 16-bit values, and
  // perform vectorized shift.  Make sure that upper 8 bits of 16-bit
  // results are all 0.
  __m256i a_hi_8 = _mm512_extracti64x4_epi64(a, 1);
  __m256i b_hi_8 = _mm512_extracti64x4_epi64(b, 1);
  __m512i a_hi_16 = _mm512_cvtepi8_epi16(a_hi_8);
  __m512i b_hi_16 = _mm512_cvtepi8_epi16(b_hi_8);
  __m512i c_hi_16 = _mm512_and_si512(_mm512_sllv_epi16(a_hi_16, b_hi_16), mask);

  // Cast 16-bit results back into 8-bit values and merge them
  // together (using unsigned saturation with higher 8 bits set to 0
  // above ensures that results are correct).  Values are merged per
  // lanes, so this is not yet the final result.
  __m512i c_perm = _mm512_packus_epi16(c_lo_16, c_hi_16);

  // Permute values so that final result is produced.
  __m512i idx = _mm512_set_epi64(7, 5, 3, 1, 6, 4, 2, 0);
  __m512i c = _mm512_permutexvar_epi64(idx, c_perm);

  // ------- End copied

    print_m512i_int8(&c);
    // Expected output: 1(x8), 2(x8), 4(x8), 8(x8), 16(x8), 32(x8), 64(x8), 128(x8), -128(x8)

    return 0;
}
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88607
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10
2022-11-13 00:31:11 +00:00
df1df9d10a [16/N] Add _allgather_base custom op with CPU/CUDA implementation (#88889)
Differential Revision: [D41227739](https://our.internmc.facebook.com/intern/diff/D41227739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88889
Approved by: https://github.com/kwen2501
2022-11-12 22:31:07 +00:00
3765621356 torchdynamo support self.modules() for nn_module (#88695)
This PR allows models to call self.modules() during dynamo tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88695
Approved by: https://github.com/voznesenskym
2022-11-12 20:00:51 +00:00
27dc03e09b Turn internal assert when saved tensor is detached inplace into torch check (#88860)
Fixes https://github.com/pytorch/pytorch/issues/88809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88860
Approved by: https://github.com/albanD
2022-11-12 18:33:18 +00:00
4270bb37da [primTorch] Improve narrow and narrow_copy: refs, tests, docs (#87045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87045
Approved by: https://github.com/mruberry
2022-11-12 15:03:50 +00:00
6e5f736d86 [15/N] Add allreduce_coalesced custom op with CPU/CUDA implementations (#88846)
Differential Revision: [D41227740](https://our.internmc.facebook.com/intern/diff/D41227740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88846
Approved by: https://github.com/kwen2501
2022-11-12 14:23:45 +00:00
ae2c668cc0 Revert "[dynamo][api] Better support of torch.nn.Module (#88629)"
This reverts commit c83348597b195f2da1cca0e8318c878b104bce5d.

Reverted https://github.com/pytorch/pytorch/pull/88629 on behalf of https://github.com/anijain2305 due to job failing on master https://github.com/pytorch/pytorch/actions/runs/3449914495/jobs/5758267231
2022-11-12 07:52:56 +00:00
6b775c42dd [quant][executorch] Support quant fusion for reshape in quant in executorch stack (#88858)
Summary: This diff added support for fusing "dq - reshape - q" to a reshape op, the op is needed in wakeword model

Test Plan: buck test executorch/exir/tests:quant_fusion_pass

Reviewed By: qihqi, JacobSzwejbka

Differential Revision: D41111069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88858
Approved by: https://github.com/JacobSzwejbka
2022-11-12 07:52:44 +00:00
34641c4384 Revert "Add comprehensive minifier tests (#88022)"
This reverts commit 5ff600aa6e40c6b4d426594bbb1f446f005b7fb3.

Reverted https://github.com/pytorch/pytorch/pull/88022 on behalf of https://github.com/wconstab due to Seems to be causing CI failures relating to minifier test and some /tmp/ path not existing
2022-11-12 05:16:41 +00:00
c83348597b [dynamo][api] Better support of torch.nn.Module (#88629)
This is an API change, so please review carefully.

With this PR, torchdynamo returns an `OptimizedModule` class object, a subclass of `torch.nn.Module`, when asked to optimize a `nn.Module` object. Most of the methods are redirected to the original `nn.Module`, which is installed as `_mod` in the `OptimizedModule`.

This is helpful for many cases

```
mod = MockModule()

opt_mod = torch._dynamo.optimize()(mod)

print(opt_mod) # Works

opt_mod = opt_mod.to(device="cuda")
print(opt_mod) # Works
opt_mod(input) # Triggers recompile if necessary, earlier we were shedding the TorchDynamo wrapper

opt_mod.parameters() # Refers to the original module

```

Topics unclear to me
* I have overridden many methods to raise NotImplementedError. A careful review of those will be good.
* hooks
* For the optimized forward, should we call torchdynamo optimization on `__call__` or `forward`
* What else to test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88629
Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/msaroufim
2022-11-12 04:45:17 +00:00
d01bf1d1f1 [FSDP] Introduce ModuleWrapPolicy for simplicity (#88450)
**BC Breaking Change**
This renames `unwrapped_params` to `nonwrapped_numel`. I prefer `nonwrapped` over `unwrapped` because "unwrap"  suggests that some wrapping has been undone. I prefer `numel` over `params` because that is unit of measurement; I think we should keep "params" to refer to `nn.Parameter`s themselves.

This only breaks anything that passes `unwrapped_params` as a keyword argument, but I did not see anything that did that (except the one internal benchmark file but that does not actually depend on our `pytorch` code).

In a follow-up, I want to rename `min_num_params` to `min_nonwrapped_numel` in `size_based_auto_wrap_policy`, which is also BC breaking. Again, this is to differentiate between "params" being `nn.Parameter`s and "numel" being the unit for `param.numel()`.

**Overview**
This PR introduces `ModuleWrapPolicy` as a lightweight layer over the existing `transformer_auto_wrap_policy`. The most common auto wrapping paradigm is:
```
module_classes: Set[Type[nn.Module]] = ...
auto_wrap_policy = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls=module_classes,
)
fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)
```
Now, users can instead write:
```
auto_wrap_policy = ModuleWrapPolicy(module_classes)
fsdp_model = FSDP(model, auto_wrap_policy=auto_wrap_policy, ...)
```
This hides the unused arguments expected from the callable (`recurse` and `unwrapped_params`/`nonwrapped_numel`).

`ModuleWrapPolicy` inherits from an abstract base class `FSDPPolicy` that expects a `policy` property. This decouples the construct of such `FSDPPolicy` classes and their actual `policy`, which must abide by the `_recursive_wrap` interface. Any existing auto wrap policy can be rewritten as a class that inherits from `FSDPPolicy`, so this approach is fully backward compatible from a functionality perspective.

I call this base class `FSDPPolicy` to generalize over the cases where we may not want to actually perform any nested wrapping. In reality, the policy is meant for constructing `FlatParameter`s, which just happened to be induced by a nested wrapping before. Given this, I am changing the constructor argument in `fully_shard()` to simply `policy` instead of `auto_wrap_policy`.

This PR migrates usages of `transformer_auto_wrap_policy` within our unit test suite to `ModuleWrapPolicy` as much as possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88450
Approved by: https://github.com/zhaojuanmao
2022-11-12 04:14:32 +00:00
b2b0a0d3ba [vision hash update] update the pinned vision hash (#88920)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88920
Approved by: https://github.com/pytorchbot
2022-11-12 03:21:08 +00:00
ae4074669e [FSDP][state_dict][6/N] Remove most FSDP module dependency from _optim_utils (#88638)
**What**
This PR removes most `FullyShardedDataParallel` dependencies from `optim_utils`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88638
Approved by: https://github.com/awgu
2022-11-12 03:16:37 +00:00
4108367123 Exclude poolformer_m36 from the inductor model test (#88908)
Summary: The root cause is still to be investigated. Issue tracked at
https://github.com/pytorch/torchdynamo/issues/1856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88908
Approved by: https://github.com/malfet
2022-11-12 03:10:25 +00:00
1e2327baf7 fix fx tests (#88886)
Summary:
Some source files are missing and TPX couldn't handle the default test
names.

Test Plan: Rely on CI.

Differential Revision: D41218564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88886
Approved by: https://github.com/zou3519
2022-11-12 02:23:48 +00:00
66736ff425 Fix bug in OptionalTensorList (#88887)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88887
Approved by: https://github.com/anjali411
2022-11-12 02:19:46 +00:00
2b166532f7 Remove incorrect assert about hermetic state. (#88885)
I'm not sure why I thought this assert was valid in the first
place, and there's no comment about it.

The assert is tantamount to saying, "no tensor objects should
become dead via SafePyObject when hermetic mode is on."  But
suppose we run a Python GC while we're inside hermetic mode.
This could result in us disposing non-hermetic tensors, which
would hit decref.  So the assert seems invalid.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88885
Approved by: https://github.com/anjali411, https://github.com/malfet
2022-11-12 02:19:45 +00:00
2cd05a2818 Support torch.qint32 in Convert (#88871)
Enable the `torch.qint32` when creating `quantize_per_tensor` function call in `convert_fx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88871
Approved by: https://github.com/jerryzh168
2022-11-12 01:20:52 +00:00
a3f3ec8fac [FSDP+dynamo]: forward treats parameter-views as params (#88781)
Dynamo+AotAutograd needs a way to wrap all tensors (whether
inputs or params/buffers) in FakeTensor wrappers, and
FSDP's mangling of parameters hides them from this wrapping.

This PR unblocks running hf_bert and hf_T5 with FSDP under dynamo, whether using recursive wrapping around transformer layers or only applying FSDP around the whole model.  Perf/memory validation and possibly optimization is the next step.
`python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager`
`python benchmarks/dynamo/distributed.py --torchbench_model hf_Bert --fsdp --dynamo aot_eager --fsdp_wrap`
`python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager`
`python benchmarks/dynamo/distributed.py --torchbench_model hf_T5 --fsdp --dynamo aot_eager --fsdp_wrap`

The problem:
Dynamo (Actually aot_autograd) trips up with FSDP becuase it must
wrap all input tensors in FakeTensor wrappers, and it only knows
to wrap graph inputs or named_(parameters, buffers).  FSDP's
pre_forward hook sets views (which are not nn.param) into the flatparam
as attrs on the module with the same name as the original param, but
they will not show up in named_parameters.

- in use_orig_params mode, FSDP still de-registers
  params during pre-forward hook, then re-registers them
  post-forward
- during forward (between the hooks), the params are setattr'd
  on the module as regular view tensors, not nn.Parameters
- note: use_orig_params is the recommended way to use FSDP,
  and use_orig_params=False is being deprecated.  So i only consider
  use_orig_params=True for this enablement

The solution:
- adding them to named_buffers is not possible because it interferes
  with how FSDP's `_apply` works
- since they are not actual nn.parameters, register_parameter will
  complain about registering them
- simply seting `module._parameters[name] = view` seems to be a viable
  workaround, despite being hacky, and FSDP code does modify _parameters
  directly already.

Note: Manual checkpointing still isn't working with FSDP+dynamo,
so that will have to be addressed in a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88781
Approved by: https://github.com/ezyang, https://github.com/awgu
2022-11-12 01:17:23 +00:00
5ff600aa6e Add comprehensive minifier tests (#88022)
Adds tests for https://github.com/pytorch/torchdynamo/issues/1241.

To run: `pytest test/dynamo/test_minifier.py`.

Actually runs minifier launcher script and repro scripts, rather than just checking for existence of the minifier launcher script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88022
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2022-11-12 00:22:25 +00:00
37c5b42fa6 Fix matmul decomp to use reshape instead of contiguous().view() (#88832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88832
Approved by: https://github.com/bertmaher, https://github.com/ngimel
2022-11-12 00:15:42 +00:00
7c3adddd6c [functorch] delete some unused files (#88763)
Some post-merge cleanup.
- packaging/ was for building standalone windows binaries
- our flake8 config got superceded by PyTorch's.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88763
Approved by: https://github.com/samdow
2022-11-11 23:58:51 +00:00
a7fa423f48 copy_: Short-circuit when self and src view the same data (#88884)
This comes up if you use inplace operators on a slice, e.g.
```python
import torch
a = torch.rand(1000000, device="cuda")
a[::2] *= 2
```

The last line looks as if it should be fully inplace, but is actually
equivalent to:

```python
tmp = a[::2]
tmp *= 2
a[::2] = tmp
```

Which results in `mul_` and `copy_` being called. With this PR, the
redundant copy becomes a no-op and the above example is 2x faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88884
Approved by: https://github.com/ngimel
2022-11-11 23:31:15 +00:00
6fe47b682f [Dynamo] Fix str(Guard.obj_weakref) bug to re-ennable support overriding __getattr__ (#88564)
See my inline comments!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88564
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2022-11-11 22:31:32 +00:00
be8d88f8d0 [DataLoader] Removing DataLoader2 related code (#88848)
Removing these lines of code as `DataLoader2` has been added to [TorchData](https://github.com/pytorch/data). I'm importing this to confirm it will not impact internal codes.

Differential Revision: [D41201578](https://our.internmc.facebook.com/intern/diff/D41201578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88848
Approved by: https://github.com/ejguan
2022-11-11 22:27:01 +00:00
f39cad50b7 Make InductorCPU usable in internally (#88870)
Test Plan: `buck2 test mode/opt //caffe2/test:test_inductor -- --exact 'caffe2/test:test_inductor - test_dtype_mismatch_issue_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)'`

Differential Revision: D41206109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88870
Approved by: https://github.com/izaitsevfb
2022-11-11 22:07:34 +00:00
fbc1878265 [ONNX] Pretty print diagnostic logging (#88261)
Adds pretty print diagnostic logging. For example
```python
import io
import torch
from torch.onnx._internal import diagnostics

class CustomAdd(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        return x + y

    @staticmethod
    def symbolic(g, x, y):
        return g.op("custom::CustomAdd", x, y)

class M(torch.nn.Module):
    def forward(self, x):
        return CustomAdd.apply(x, x)

# trigger warning for missing shape inference.
# rule = diagnostics.rules.node_missing_onnx_shape_inference
torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO())
```

By default, observe minimum summary of diagnostics
```
========= Diagnostic Run torch.onnx.export version 1.14.0a0+git90a69c5 =========
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 3 WARNING 0 ERROR ========================
3 WARNING were not printed due to the log level.
```

Adjusting the `verbose` and `level` argument.
```python
diagnostics.engine.pretty_print(verbose=True, level=diagnostics.levels.WARNING)
```

Prints full log.
```
=============================== 1 Diagnostic Run ===============================
========= Diagnostic Run torch.onnx.export version 1.14.0a0+git90a69c5 =========
verbose: True, log level: Level.WARNING
======================= 0 NONE 0 NOTE 3 WARNING 0 ERROR ========================
WARNING: node-missing-onnx-shape-inference
==========================================
The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
--------------------------- Stack: Python call stack ---------------------------
frame: diagnostic = ExportDiagnostic(rule, level, message, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151
frame: n, utils._params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/_patch_torch.py:82
frame: <@beartype(torch.onnx._patch_torch._graph_op) at 0x7f62184b6710>:78
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: return function(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_deprecation.py:30
frame: return g.op("custom::CustomAdd", x, y) test_pretty_print.py:14
frame: return symbolic_fn(g, *args) /home/bowbao/pytorch_dev/torch/onnx/utils.py:1716
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: graph = _C._jit_pass_onnx(graph, operator_export_type) /home/bowbao/pytorch_dev/torch/onnx/utils.py:663
frame: <@beartype(torch.onnx.utils._optimize_graph) at 0x7f62180e05f0>:85
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: module=module, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1123
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519
frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22
---------------------------- Stack: C++ call stack -----------------------------
frame: (<unknown frame>)
frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::UpdateReliable(torch::jit::Value*, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::UpdateReliable(torch::jit::Node*) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0xabd026 (0x7f625b599026 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown frame>)

WARNING: node-missing-onnx-shape-inference
==========================================
The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
--------------------------- Stack: Python call stack ---------------------------
frame: diagnostic = ExportDiagnostic(rule, level, message, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151
frame: graph, params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/utils.py:688
frame: <@beartype(torch.onnx.utils._optimize_graph) at 0x7f62180e05f0>:85
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: module=module, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1123
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519
frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22
---------------------------- Stack: C++ call stack -----------------------------
frame: (<unknown frame>)
frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::UpdateReliable(torch::jit::Value*, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::UpdateReliable(torch::jit::Node*) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0x87d6d1 (0x7f625b3596d1 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::ONNXShapeTypeInference(std::shared_ptr<torch::jit::Graph>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0x33 (0x7f625b359cf3 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0xabdbae (0x7f625b599bae in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown frame>)

WARNING: node-missing-onnx-shape-inference
==========================================
The shape inference of custom::CustomAdd type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.
--------------------------- Stack: Python call stack ---------------------------
frame: diagnostic = ExportDiagnostic(rule, level, message, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/diagnostics/_diagnostic.py:151
frame: graph, params_dict, GLOBALS.export_onnx_opset_version /home/bowbao/pytorch_dev/torch/onnx/utils.py:1179
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: dynamic_axes=dynamic_axes, /home/bowbao/pytorch_dev/torch/onnx/utils.py:1539
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: export_modules_as_functions=export_modules_as_functions, /home/bowbao/pytorch_dev/torch/onnx/utils.py:519
frame: <@beartype(torch.onnx.utils.export) at 0x7f62180e0170>:347
frame: return beartyped(*args, **kwargs) /home/bowbao/pytorch_dev/torch/onnx/_internal/_beartype.py:81
frame: torch.onnx.export(M(), torch.randn(3, 4), io.BytesIO()) test_pretty_print.py:22
---------------------------- Stack: C++ call stack -----------------------------
frame: (<unknown frame>)
frame: (<unknown function> + 0x88411b (0x7f625b36011b in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::UpdateReliable(torch::jit::Value*, std::pair<bool, bool> const&) + 0x7d3 (0x7f625b351743 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::UpdateReliable(torch::jit::Node*) + 0x4f (0x7f625b35198f in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::ONNXShapeTypeInference(torch::jit::Node*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0xac9 (0x7f625b357179 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0x87d6d1 (0x7f625b3596d1 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (torch::jit::ONNXShapeTypeInference(std::shared_ptr<torch::jit::Graph>&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10::IValue, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, c10::IValue> > > const&, int) + 0x33 (0x7f625b359cf3 in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0xabdbae (0x7f625b599bae in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown function> + 0x3c0fda (0x7f625ae9cfda in /home/bowbao/pytorch_dev/torch/lib/libtorch_python.so))
frame: (<unknown frame>)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88261
Approved by: https://github.com/abock, https://github.com/justinchuby
2022-11-11 21:59:16 +00:00
ea0ec9d71c [tourch] BatchBoxCox - fix numerical issue in vectorized code (#88875)
Summary:
Usage of fast math in BatchBoxCox kernel provided different math results between dev and optimized versions which cause few internal test to fail.
For now disabling the compiler optimized version and relying on ATEN vectors

Differential Revision: D41211784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88875
Approved by: https://github.com/hyuen
2022-11-11 21:58:23 +00:00
dfb4b73e45 Fix unused variable 'options' warning in RNN.cpp (#88753)
Fixes
```
/home/rbarnes/pytorch/aten/src/ATen/native/cudnn/RNN.cpp:73:17: warning: unused variable 'options' [-Wunused-variable]
  TensorOptions options = TensorOptions().dtype(dtype).layout(layout).device(device).pinned_memory(pin_memory);
                ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88753
Approved by: https://github.com/soumith
2022-11-11 21:51:13 +00:00
7aa144ac54 [FSDP][state_dict][5/N] Remove the FSDP module dependency from _state_dict_utils (#88637)
**What**
This PR completely removes the `FullyShardedDataParallel` dependency from `_state_dict_utils` -- `_state_dict_utils` now depends only on `_FSDPState` and all the utils modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88637
Approved by: https://github.com/awgu
2022-11-11 21:22:13 +00:00
575e02df53 Fix CUDNN_PATH handling on Windows (#88898)
Fixes https://github.com/pytorch/pytorch/issues/88873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88898
Approved by: https://github.com/kit1980
2022-11-11 21:19:26 +00:00
f74946324e [fix] allow saving python attr on Tensor and Parameter via torch.save (#81616)
Fixes: https://github.com/pytorch/pytorch/issues/72129

TODO:
* [x] Fix for Parameter

Benchmark
(Measurable diff for small tensors)
```
[-------------- Save and Load --------------]
                    |  After PR  |  Before PR
1 threads: ----------------------------------
      ()            |    111.7   |     106.9
      (4, 4)        |    114.4   |     109.2
      (128, 128)    |    135.2   |     128.3
      (1024, 1024)  |   1431.9   |    1431.3

Times are in microseconds (us).
```

<details>

<summary> Benchmark Script </summary>

```python
import torch
from torch.testing._internal.common_utils import BytesIOContext
from torch.utils import benchmark
import pickle

shapes = ((), (4, 4), (128, 128), (1024, 1024))

sizes = [1, 64, 1024, 10000]
results = []

def save_load_fn(t):
    with BytesIOContext() as f:
        torch.save(t, f)
        f.seek(0)
        torch.load(f)

for shape in shapes:
    t = torch.randn(shape)
    label = 'Save and Load'
    sub_label = f'{shape}'
    results.append(benchmark.Timer(
        stmt='save_load_fn(t)',
        globals={'t': t, 'save_load_fn':save_load_fn},
        label=label,
        sub_label=sub_label,
        description='Before PR',
    ).blocked_autorange(min_run_time=2))

compare = benchmark.Compare(results)
compare.print()

with open('before_pr.pkl', 'wb') as f:
    pickle.dump(results, f)

# with open('after_pr.pkl', 'rb') as f:
#     after_pr = pickle.load(f)

# with open('before_pr.pkl', 'rb') as f:
#     before_pr = pickle.load(f)

# compare = benchmark.Compare(after_pr + before_pr)
# compare.print()
```

</details>

NOTE : **BC-Breaking** : After this PR, all tensors (also regular tensors) will be serialised using `_rebuild_from_type_v2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81616
Approved by: https://github.com/albanD, https://github.com/kurtamohler
2022-11-11 21:11:12 +00:00
ba4d5aae06 Revert "rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218)"
This reverts commit 7f28be10e5e71efda37800384fa897785499bed1.

Reverted https://github.com/pytorch/pytorch/pull/88218 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41211901
2022-11-11 19:13:05 +00:00
4e5d7afe84 Revert "add DisableTorchFunction that matches DisableTorchDispatch (#88219)"
This reverts commit c0ecce15b5a54ff0185f9976e6bfb6f3a7de698d.

Reverted https://github.com/pytorch/pytorch/pull/88219 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41211901
2022-11-11 19:08:30 +00:00
9d7d21f569 [ONNX] Add stack info to diagnostics (#87258)
~~Investigating strange bug releasing 'graph' right when returning from `_C._jit_pass_onnx`.~~
~~Can be repro-ed locally via `test_cpp_diagnose`, with changes in this PR.~~
Resolved by https://github.com/pytorch/pytorch/pull/87829.
This PR adds methods to record stack backtrace information to diagnostics.

* #87830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87258
Approved by: https://github.com/abock
2022-11-11 18:58:15 +00:00
3d1c5c89ed [FSDP][state_dict][4/N] Move the core logic of summon full parameters to _unshard_params_utils.py (#88636)
**What**
`_summon_full_parameters` is required for state_dict. To enable composable FSDP state_dict, `_summon_full_params` must be accessible without FullyShardedDataParall. This PR move the core logic of `_summon_full_params` to `_unshard_params_utils`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88636
Approved by: https://github.com/awgu
2022-11-11 18:30:57 +00:00
5f0783bd6d Fix ATen Fallback for BUILD_CAFFE2=0 for ONNX-only ops (#88504)
Follow-up for #87735

Once again, because BUILD_CAFFE2=0 is not tested for ONNX exporter, one scenario slipped through. A use case where the model can be exported without aten fallback when operator_export_type=ONNX_ATEN_FALLBACK and BUILD_CAFFE2=0

A new unit test has been added, but it won't prevent regressions if BUILD_CAFFE2=0 is not executed on CI again

Fixes #87313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88504
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-11 17:43:46 +00:00
8ff2e34ca6 Take input striding for conv forward based on eager output (#88706)
From discussion with @Chillee and @ngimel we'll likely need further fixes to ensure that we hit channels last kernels but this is still worth landing in its own right.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88706
Approved by: https://github.com/ngimel
2022-11-11 17:29:15 +00:00
adfbd831cf Revert "[Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339)"
This reverts commit 8f66ae413f8c9d7f2418d7f0b9f69d409c455b46.

Reverted https://github.com/pytorch/pytorch/pull/88339 on behalf of https://github.com/mehtanirav due to Internal test failures
2022-11-11 17:03:25 +00:00
89a326ff7e Explicitly check filelike arg of torch.save (#88867)
Fixes #88793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88867
Approved by: https://github.com/ezyang
2022-11-11 16:57:08 +00:00
a6832b08a3 Regularize bernouilli_ with bernouilli decomp (#88349)
Fix for https://github.com/pytorch/torchdynamo/issues/1796. Just like the other [bernouilli decomp](https://github.com/pytorch/pytorch/blob/master/torch/_inductor/decomposition.py#L302) we need to pass `dtype=float32` to avoid `"check_uniform_bounds" not implemented` errors.

Are we planning on enabling `TEST_WITH_TORCHINDUCTOR` ? Do I need to change anything with the tests ?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88349
Approved by: https://github.com/desertfire
2022-11-11 16:53:02 +00:00
1e8f95ace1 Symintify broadcast_to (#88776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88776
Approved by: https://github.com/ezyang
2022-11-11 15:49:43 +00:00
d615d12289 Add meta impl for topk (#88694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88694
Approved by: https://github.com/ezyang
2022-11-11 15:28:41 +00:00
3c7f96665e [FSDP][state_dict][3/N] Change how state_dict utils access attributes in _FSDPState (#88635)
**What This PR Does**
_state_dict_utils currently accesses the FSDP states through module. To enable composable FSDP state_dict, these accesses need to go through _FSDPState. module is still required for most APIs as state_dict has to access per-module information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88635
Approved by: https://github.com/awgu
2022-11-11 15:20:36 +00:00
b92acee8f8 Add context manager to allow mutation on saved tensors (#79056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79056
Approved by: https://github.com/albanD
2022-11-11 15:18:28 +00:00
91b71cdbe4 [dynamo] Add torch.device to is_safe_constant (#88766)
Test Plan:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py -k  test_advancedindex_mixed_cpu_devices_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88766
Approved by: https://github.com/jansel
2022-11-11 15:06:17 +00:00
324ac93a43 [FSDP][state_dict][2/N] Move state_dict related enums/dataclasses/states to state_dict_utils.py, api.py and init_state_dict() (#88481)
**Motivation**:
Several Enums, Dataclasses and states defined in fully_sharded_data_paralle.py should be moved to a place where the composable FSDP can access. This PR does the move.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88481
Approved by: https://github.com/rohan-varma, https://github.com/awgu
2022-11-11 12:28:37 +00:00
ee91c328da Fix cuda/cpu check on NoneType (#88854)
Summary: Fix cuda/cpu check on NoneType

Test Plan: sabdcastle/ github CI/CD

Differential Revision: D41203955

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88854
Approved by: https://github.com/drisspg, https://github.com/ngimel
2022-11-11 12:19:31 +00:00
d15a6b0c97 Error on ZeroTensor serialization (#88803)
Follow-up : https://github.com/pytorch/pytorch/pull/88182#issuecomment-1308628415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88803
Approved by: https://github.com/anjali411
2022-11-11 08:51:29 +00:00
b843f4db0a [ONNX] Add test case for onnx::Max scalar type (#88751)
Referenced by minimum cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88751
Approved by: https://github.com/wschin, https://github.com/BowenBao
2022-11-11 07:08:56 +00:00
396c3b1d88 Use atomicAdd for bfloat16 in Ampere and above (#84981)
WIP to fix extremely slow `scatter_add` issue vs. fp16. The current changes seem to improve performance, but it still appears to lag behind the fp16 equivalent.

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84981
Approved by: https://github.com/ngimel
2022-11-11 05:23:48 +00:00
a6d72f44a4 [ONNX] Add onnx::Max into standard Op for scalar type alignment (#88750)
Easy fix for onnx::Max ScalarType
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88750
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-11 04:22:04 +00:00
0de8f047c1 Revert "[dynamo] fixes dict changed during runtime error (#87526)"
This reverts commit cf04b36ce8f531730210b03eaa347977a1c2d75c.

Reverted https://github.com/pytorch/pytorch/pull/87526 on behalf of https://github.com/anijain2305 due to error reported
2022-11-11 04:19:08 +00:00
310335de48 Update lr_scheduler.pyi to match lr_scheduler.py (#88818)
Following #88503, we should also update the pyi file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88818
Approved by: https://github.com/soulitzer
2022-11-11 04:02:44 +00:00
86b7aa26f0 Fix FakeTensorProp on Module with Parameters or Buffers (#88700)
In `FakeTensorMode.__torch_dispatch__`, the output is now always computed by meta kernels in
```python
        try:
            with in_kernel_invocation_manager(self):
                r = func(*args, **kwargs)  # <----- "r" can be a real tensor.
        except NotImplementedError as not_implemented_error:
            # no meta kernel registered, fallback to kernel for the device
            if not self.allow_fallback_kernels:
                raise not_implemented_error
            return run_fallback_kernel(self, func, args, kwargs, not_implemented_error)

        return self.wrap_meta_outputs_with_default_device_logic(r, func, args, kwargs)
```
For example, I observed a CPU tensor is generated when executing `aten.addmm` when running `FakeTensorProp`. Therefore, I'd like to allow `FakeTensorMode` to wrap real tensor as `FakeTensor` during the computation. Does this PR look a good direction to fix this problem? If yes, I can go ahead and add some tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88700
Approved by: https://github.com/eellison, https://github.com/ezyang
2022-11-11 03:49:29 +00:00
c4fc5d372f [FSDP][state_dict][1/N] Moving state_dict logic to pre_state_dict_hook (#87900)
This is one step toward the ultimate goal: remove the overwritten state_dict in FSDP. All the logic should be either in `pre_state_dict_hook` or `post_state_dict_hook`.

Since current `nn.Module` does not support `pre_state_dict_hook`, this PR mimic `pre_state_dict_hook` by calling the pre hook inside post the hook, effectively ditching all the work done by `nn.Module.state_dict`. Once `pre_state_dict_hook` is supported by `nn.Module`, these pre hook calls can be moved out from the post hooks and be registered to `nn.Module.pre_state_dict_hook`.

The major issue of this temporary solution is that `post_state_dict_hook` is called from the leaf node to the root node. This makes the `module._lazy_init()` invalid as FSDP assumes `_lazy_init()` to be called from the root. As a result, `FSDP.state_dict` currently contains only one logic -- calling `module._lazy_init()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87900
Approved by: https://github.com/rohan-varma
2022-11-11 03:41:40 +00:00
9d09968bbe Disable check for dropout in MultiheadAttention fast_path (#88831)
Since we already enforce eval mode for the fast_path, we do not need to also check for a falsy dropout value, as a model trained with dropout will have a non-zero dropout during eval mode, even though it won't be applied.

Fixes #88806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88831
Approved by: https://github.com/drisspg
2022-11-11 03:34:57 +00:00
3082378701 [vision hash update] update the pinned vision hash (#88853)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88853
Approved by: https://github.com/pytorchbot
2022-11-11 03:33:58 +00:00
495e7b1c72 Ref for aten.full; symint changes in prim (#88762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88762
Approved by: https://github.com/ezyang
2022-11-11 02:32:09 +00:00
3fbf748f21 Assert we have triton before scheduling on triton (#88849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88849
Approved by: https://github.com/wconstab, https://github.com/ngimel, https://github.com/jansel
2022-11-11 02:30:29 +00:00
fc9e36dd42 Add meta support for scalar_tensor and argmax (#88590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88590
Approved by: https://github.com/albanD
2022-11-11 01:31:00 +00:00
c961e45ee5 handle zero dims in reductions (#88280)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88280
Approved by: https://github.com/ngimel
2022-11-11 01:13:57 +00:00
534ae6ae47 [primTorch] Implement group norm reference (#87054)
Add group norm reference
Split from #81191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87054
Approved by: https://github.com/mruberry
2022-11-11 01:08:20 +00:00
072834d56d [ao] qconfig_mapping.py fixing public v private (#87518)
Summary: made _GLOBAL_DICT_KEY, _OBJECT_TYPE_DICT_KEY,
_MODULE_NAME_REGEX_DICT_KEY, _MODULE_NAME_DICT_KEY,
_MODULE_NAME_OBJECT_TYPE_ORDER_DICT_KEY private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709278](https://our.internmc.facebook.com/intern/diff/D40709278)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87518
Approved by: https://github.com/jcaip
2022-11-11 00:32:24 +00:00
f9221bf53b [pytorch] Enable memory map file support for Android, Apple, and CXX (#88545)
Summary: See title.  Left Windows out so it still compiles.

Test Plan:
Add a `#fail` below [this line](https://fburl.com/code/p0mlhlw4) and build for various platforms and confirm it fails which proves the `#ifdef` was hit.

```
buck2 build xplat/langtech/tuna/cli:tuclixAndroid
buck2 build xplat/langtech/tuna/cli:tuclix
```

CI/CD for the rest.

Differential Revision: D41054824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88545
Approved by: https://github.com/qihqi
2022-11-11 00:19:20 +00:00
8441443132 Revert "Add nondeterministic error for scatter (#88244)"
This reverts commit e940a2f8e2a3aa9d98291e73b3d40fcffb6182c8.

Reverted https://github.com/pytorch/pytorch/pull/88244 on behalf of https://github.com/mehtanirav due to Internal test failures
2022-11-10 23:56:49 +00:00
62ef15e320 [MPS] Fix test_embedding_dense_backward (#88847)
By copying randomly initialized weights distribution from MPS `nn.Embedding` to `cpu`

Test plan: `python test_mps.py -k test_embedding_dense_backward --repeat 150`

Fixes https://github.com/pytorch/pytorch/issues/88679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88847
Approved by: https://github.com/seemethere
2022-11-10 23:52:27 +00:00
b30222e0c4 [Dynamo] Add complete support for Tensor.is_contiguous (#88407)
Fixes https://github.com/pytorch/torchdynamo/issues/1783

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88407
Approved by: https://github.com/jansel
2022-11-10 23:47:21 +00:00
ae01615d75 Fix cupti search path in CMake (#88657)
Minor fix for when cuda is installed via conda. In this case the libraries are in `lib` and not `lib64`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88657
Approved by: https://github.com/kit1980, https://github.com/malfet
2022-11-10 23:44:52 +00:00
d9ad08ce8a Symbolic shape: sym_floor , sym_sqrt, sym_int (#88760)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88760
Approved by: https://github.com/ezyang
2022-11-10 23:41:33 +00:00
cc04cf50bf [Inductor] Fix lowmem_dropout() missing 1 required positional argument: 'p' (#88716)
Fixes error from 7k github models: https://github.com/jansel/pytorch-jit-paritybench/blob/master/generated/test_GuYuc_WS_DAN_PyTorch.py

Error:
```
TypeError: lowmem_dropout() missing 1 required positional argument: 'p'

While executing %lowmem_dropout : [#users=1] = call_function[target=torch._inductor.overrides.lowmem_dropout](args = (%avg_pool2d_9,), kwargs = {training: False})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88716
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/desertfire
2022-11-10 23:37:29 +00:00
500fd65531 [ONNX] Create common ExportTestCase base class (#88145)
Refactor out a common base class `ExportTestCase`, for common things in `setUp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88145
Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/AllenTiTaiWang
2022-11-10 21:51:59 +00:00
20ae19aa1d [ONNX] Improve diagnostic message formatting (#87830)
* Reflect required arguments in method signature for each diagnostic rule. Previous design accepts arbitrary sized tuple which is hard to use and prone to error.
     ![image](https://user-images.githubusercontent.com/9376104/200381982-d1e905f0-a159-4ef5-8d2e-070524e8f5bf.png)
* Removed `DiagnosticTool` to keep things compact.
* Removed specifying supported rule set for tool(context) and checking if rule of reported diagnostic falls inside the set, to keep things compact.
* Initial overview markdown file.
* Change `full_description` definition. Now `text` field should not be empty. And its markdown should be stored in `markdown` field.
* Change `message_default_template` to allow only named fields (excluding numeric fields). `field_name` provides clarity on what argument is expected.
* Added `diagnose` api to `torch.onnx._internal.diagnostics`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87830
Approved by: https://github.com/abock
2022-11-10 21:42:17 +00:00
a6610faa93 [ao] qconfig_mapping_utils.py fixing public v private (#87517)
Summary: made _get_object_type_qconfig, _get_module_name_regex_qconfig,
_get_module_name_qconfig, _maybe_adjust_qconfig_for_module_type_or_name,
_get_flattened_qconfig_dict _update_qconfig_for_qat private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709279](https://our.internmc.facebook.com/intern/diff/D40709279)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87517
Approved by: https://github.com/jcaip
2022-11-10 21:40:39 +00:00
c1553880de Have kernel names include fused ops (#88624)
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long

Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-11-10 21:38:06 +00:00
ad2eba802c [ao] fuser_method_mappings.py fixing public v private (#87516)
Summary: made _get_valid_patterns, _DEFAULT_PATTERN_TO_FUSER_METHOD,
_reverse3, _reverse2, _reverse_sequential_wrapper2,
_DEFAULT_OP_LIST_TO_FUSER_METHOD, _sequential_wrapper2 private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709281](https://our.internmc.facebook.com/intern/diff/D40709281)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87516
Approved by: https://github.com/jcaip
2022-11-10 21:37:31 +00:00
37b468ac77 [xnnpack][lite-int][on-device] rebuild serialized modules at runtime (#88780)
This is the on-device runtime work. We modify the compile and execute from our hacky solution from before to what will actually be running at runtime.

First we rebuild our graph from the serialized flatbuffer string. We also introduce a runtime wrapper that inherits CustomClassHolder that allows us to forward along the built xnngraph runtime to our execute function

Once the subgraph object has been rebuilt by our we pass it along to the runtime wrapper for us to forward along to execute

At execute we prep the input/outputs and invoke the runtime using our runtime wrapper. Finally we forward those results to our execution

Differential Revision: [D39413031](https://our.internmc.facebook.com/intern/diff/D39413031/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39413031/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88780
Approved by: https://github.com/digantdesai
2022-11-10 21:35:28 +00:00
de38c87698 Use run_test in MPS (#88829)
Run mps through run_test to get disable test infra, create xml files (which can then be used for flakiness detection), and reruns

Also added the workflow steps for uploading the xml files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88829
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-11-10 21:32:41 +00:00
1ae772a663 [inductor] Remove import check for fast_flush (#88812)
https://github.com/pytorch/pytorch/pull/88557/ has a guard to make sure that triton's `do_bench` includes the `fast_flush` argument.  Since we've updated Triton to a sufficiently recent revision, we can remove that guard.

Differential Revision: [D41185280](https://our.internmc.facebook.com/intern/diff/D41185280/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88812
Approved by: https://github.com/soumith
2022-11-10 21:12:20 +00:00
3a4e8736ad [xnnpack][on-device] compiler --> executor object (#88779)
#### XNN Compiler Object
This is purely to abstract away the subgraph rebuild from the flatbuffer object. CompileModel return an executor object which we can use to setup inputs and run forward with.

#### Executorch Considerations
We Include ATen/utils for torch_check, this will be changed when moving to executorch

Differential Revision: [D40733163](https://our.internmc.facebook.com/intern/diff/D40733163/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88779
Approved by: https://github.com/digantdesai
2022-11-10 21:09:22 +00:00
394b998de2 sub setup.py install -> develop (#88507)
If someone is building the project from source they're likely a contributor for which develop will be much more useful. For people that want to try the latest and greatest they can leverage the nightlies

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88507
Approved by: https://github.com/malfet
2022-11-10 21:04:38 +00:00
d5e1e2f0fc [xnnpack][on-device] executor class (#88778)
# Executor Class

Executor object used to wrap our xnn_runtime object. The ideal flow of this object looks as such:

```
executor.set_inputs(vector<tensor> inputs, vector<tensor> outputs)
executor.forward()
```

This will likely be returned by our delegate compile and given over to execute in order to run inference using the xnn runtime

##### Executorch Considerations
```
#include <ATen/Functions.h>
#include <ATen/Utils.h>
```
These Aten functions are included in order to use at::Tensor when setting the inputs, this will change when used for Executorch because we will be switching from at::Tensor to whatever tensor abstraction is used for ET. Seems like they have the same call for `.data_ptr<float>()`, so realistically all logic here will be the same.

ATen/Utils is used for TORCH_CHECK. We will switch to ET_CHECK_MESSAGE for executorch.

Differential Revision: [D40733121](https://our.internmc.facebook.com/intern/diff/D40733121/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88778
Approved by: https://github.com/digantdesai
2022-11-10 21:01:46 +00:00
29550e2c1d Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566)"
This reverts commit 48b58930cbfa725ac25a9303d496c76bf983574d.

Reverted https://github.com/pytorch/pytorch/pull/88566 on behalf of https://github.com/huydhn due to This change breaks trunk 48b58930cb
2022-11-10 20:56:30 +00:00
90cf14ddf6 [DataPipe] Deprecating drop_empty_batches from Filter and other functional APIs (#88693)
- Deprecating based on https://github.com/pytorch/data/issues/163

Corresponding PRs from TorchData: https://github.com/pytorch/data/pull/890
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88693
Approved by: https://github.com/NivekT
2022-11-10 19:54:22 +00:00
98ecd06580 Bring Unfold/Fold param doc order in line with code (#88819)
Now the first parameter (if used as a positional argument) is the first that is listed in the docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88819
Approved by: https://github.com/ngimel
2022-11-10 19:29:29 +00:00
1d54ce9d5d [14/N] Refactor _new_process_group_helper() to remove repeated code (#88351)
Changes:
- refactor parts of `_new_process_group_helper()` to remove repeated code

Differential Revision: [D41188274](https://our.internmc.facebook.com/intern/diff/D41188274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88351
Approved by: https://github.com/kwen2501
2022-11-10 19:27:17 +00:00
4bcf2c53e5 Add warnings & regressions info text (#88837)
Add text about what warnings and accuracy regressions dropdowns mean.

Sample: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1310770285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88837
Approved by: https://github.com/anijain2305
2022-11-10 19:22:09 +00:00
3b8245ab12 [LTC] Make ComputePostOrder accept const T pointers (#88773)
Summary:
Since `c10::ArrayRef` now support `c10::ArrayRef<const T>`, let's restore `ComputePostOrder` to accept `const Node*` again, which is more suitable for the context of the given helpers.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88773
Approved by: https://github.com/JackCaoG
2022-11-10 18:34:19 +00:00
48b58930cb [Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566)
Summary:
Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor

For an internal Ads model: 1.15x -> 1.36x speedup

Differential Revision: D41071665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88566
Approved by: https://github.com/jansel, https://github.com/jianyuh
2022-11-10 18:32:25 +00:00
d157fca59c Revert "Symintify broadcast_to (#88776)"
This reverts commit 3a09d9a129406a05ca7e82c1438f9aa83019f48d.

Reverted https://github.com/pytorch/pytorch/pull/88776 on behalf of https://github.com/malfet due to Broke functorch/test_aotdispatch on M1, see 3a09d9a129
2022-11-10 18:19:54 +00:00
6bf2776ac1 [FSDP][Perf] Do not call pad in no-padding case (#88769)
- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
- This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88769
Approved by: https://github.com/zhaojuanmao
2022-11-10 18:18:55 +00:00
d3178465ee [dynamo] VariableTracker.call_method requires a name (#88311)
Summary: as title

Test Plan: Before: N2743445, After: N2748186.  Note there's a new error, but at least we got past the easy one.

Differential Revision: D40938415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88311
Approved by: https://github.com/brad-mengchi
2022-11-10 18:17:23 +00:00
1e4079a476 [nnc] Disable opaque pointers mode in LLVM backend to allow getPointerElementType (#88798)
As of LLVM 15 typed pointers are going away:
https://llvm.org/docs/OpaquePointers.html.  Thus
`getPointerElementType` is no longer legal, since pointers are all
opaque.  I don't totally remember why we use it so prolifically, or
whether there's an easy change to get rid of it, or whether we'd need
a significant refactor to carry around `Type`s alongside `Value`s.

But in any case, NNC is deprecated (see: TorchInductor) and will
hopefully be gone before LLVM 16 is a thing.  For now, we can apply
the hack of turning off opaque pointer mode on the LLVMContext.

Differential Revision: [D41176215](https://our.internmc.facebook.com/intern/diff/D41176215)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88798
Approved by: https://github.com/desertfire
2022-11-10 18:14:02 +00:00
656d0de6c5 Change TORCH_INTERNAL_ASSERT to TORCH_CHECK and add a nice error message (#88804)
Fixes #87672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88804
Approved by: https://github.com/ezyang
2022-11-10 18:11:32 +00:00
79b049af5e Switch to setup-nvidia action (#88757)
Use the new [setup-nvidia](https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-nvidia/action.yml) action from test-infra. The new action is created so that it can be shared across different PyTorch repos. For examples:

* [pytorch/pytorch](https://github.com/pytorch/pytorch/blob/master/.github/scripts/install_nvidia_utils_linux.sh) (fixed by this PR)
* [pytorch/tau](https://github.com/pytorch/tau/blob/main/.github/workflows/install_nvidia_utils_linux.sh) (fixed by  https://github.com/pytorch/tau/pull/595)
* [pytorch/torchsnapshot](https://github.com/pytorch/torchsnapshot/blob/main/.github/scripts/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/torchsnapshot/pull/130)
* [torch/multiply](https://github.com/pytorch/multipy/blob/main/.github/scripts/install_nvidia_utils_linux.sh) (fixed by https://github.com/pytorch/multipy/pull/264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88757
Approved by: https://github.com/seemethere, https://github.com/atalman
2022-11-10 17:48:16 +00:00
f98edfcc48 Make TorchElastic timer importable on Windows (#88522)
Also, add `torch.distributed` to test imports, so that we would not
regress in the future

Fixes https://github.com/pytorch/pytorch/issues/85427
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88522
Approved by: https://github.com/d4l3k
2022-11-10 17:42:20 +00:00
4b898a7304 Symintify adaptive_avg_pool3d (#88783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88783
Approved by: https://github.com/ezyang
2022-11-10 15:23:54 +00:00
3a09d9a129 Symintify broadcast_to (#88776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88776
Approved by: https://github.com/ezyang
2022-11-10 15:21:50 +00:00
c0ecce15b5 add DisableTorchFunction that matches DisableTorchDispatch (#88219)
Closes #87990. This implements a new disable guard that matches DisableTorchDispatch (disables all subclasses and modes)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88219
Approved by: https://github.com/ezyang
2022-11-10 14:51:13 +00:00
7f28be10e5 rename DisableTorchFunction to DisableTorchFunctionSubclass (#88218)
First half of #87990. This doesn't change any of the behavior and is just a rename

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88218
Approved by: https://github.com/ezyang, https://github.com/zou3519
2022-11-10 14:51:13 +00:00
3e43ff2794 torchdynamo: add convolution add(relu) inplace fusion kernel (#88048)
This PR is about add convolution add(relu) inplace fusion kernel which  works for **other.add_(conv)**.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88048
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-10 13:54:37 +00:00
e6561291b8 add hack to allow hybrid compressed sparse comparison in assertEqual (#88749)
Hybrid sparse CSR tensors can currently not be compared to strided ones since `.to_dense` does not work:

```py
import torch
from torch.testing._internal.common_utils import TestCase

assertEqual = TestCase().assertEqual

actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12] ,[3, 13] ,[4, 14]])
expected = torch.stack([actual[0].to_dense(), actual[1].to_dense()])
assertEqual(actual, expected)
```

```
main.py:4: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at ../aten/src/ATen/SparseCsrTensorImpl.cpp:54.)
  actual = torch.sparse_csr_tensor([0, 2, 4], [0, 1, 0, 1], [[1, 11], [2, 12] ,[3, 13] ,[4, 14]])
Traceback (most recent call last):
  File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 1098, in assert_equal
    pair.compare()
  File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 619, in compare
    actual, expected = self._equalize_attributes(actual, expected)
  File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 706, in _equalize_attributes
    actual = actual.to_dense() if actual.layout != torch.strided else actual
RuntimeError: sparse_compressed_to_dense: Hybrid tensors are not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "main.py", line 10, in <module>
    assertEqual(actual, expected)
  File "/home/philip/git/pytorch/torch/torch/testing/_internal/common_utils.py", line 2503, in assertEqual
    msg=(lambda generated_msg: f"{generated_msg}\n{msg}") if isinstance(msg, str) and self.longMessage else msg,
  File "/home/philip/git/pytorch/torch/torch/testing/_comparison.py", line 1112, in assert_equal
    ) from error

RuntimeError: Comparing

TensorOrArrayPair(
    id=(),
    actual=tensor(crow_indices=tensor([0, 2, 4]),
       col_indices=tensor([0, 1, 0, 1]),
       values=tensor([[ 1, 11],
                      [ 2, 12],
                      [ 3, 13],
                      [ 4, 14]]), size=(2, 2, 2), nnz=4,
       layout=torch.sparse_csr),
    expected=tensor([[[ 1, 11],
         [ 2, 12]],

        [[ 3, 13],
         [ 4, 14]]]),
    rtol=0.0,
    atol=0.0,
    equal_nan=True,
    check_device=False,
    check_dtype=True,
    check_layout=False,
    check_stride=False,
    check_is_coalesced=False,
)

resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues. If you are a developer and working on the comparison functions, please except the previous error and raise an expressive `ErrorMeta` instead.
```

This adds a temporary hack to `TestCase.assertEqual` to enable this. Basically, we are going through the individual CSR subtensors, call `.to_dense()` on them, and stack everything back together. I opted to not do this in the common machinery, since that way users are not affected by this (undocumented) hack.

I also added an xfailed test that will trigger as soon as the behavior is supported natively so we don't forget to remove the hack when it is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88749
Approved by: https://github.com/mruberry, https://github.com/pearu
2022-11-10 13:44:45 +00:00
7c353eb395 [MPS] Fix softplus (#88555)
1. Fixes #87780
2. Fixes mps graph cache issue
3. Adds proper tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88555
Approved by: https://github.com/kulinseth
2022-11-10 09:40:08 +00:00
7ad87f63e2 Support src_mask and src_key_padding_mask for Better Transformer (#88488)
Fixes T135842750 (follow-up for #87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488
Approved by: https://github.com/mikekgfb
2022-11-10 08:12:56 +00:00
dcefea2706 [caffe2][tourch] Optimize BatchBoxCox (#87585)
Differential Revision: D40215424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87585
Approved by: https://github.com/hyuen
2022-11-10 06:11:05 +00:00
e87c79ca0c [vision hash update] update the pinned vision hash (#88742)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88742
Approved by: https://github.com/pytorchbot
2022-11-10 03:05:00 +00:00
cf04b36ce8 [dynamo] fixes dict changed during runtime error (#87526)
Fixes https://github.com/pytorch/torchdynamo/issues/1744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87526
Approved by: https://github.com/ezyang
2022-11-10 01:57:17 +00:00
0b8889c724 Do not flag models in dashboard due to NaN values (#88792)
Title.

Tested by running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-4 --training --visualize_logs` on a copy of a recent set of logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88792
Approved by: https://github.com/anijain2305
2022-11-10 01:48:04 +00:00
6e3555edea Add absolute latency to dashboard (#88790)
Add absolute latency to dashboard, as requested by https://github.com/pytorch/torchdynamo/issues/1833#issuecomment-1302742914

Tested by setting `run.sh` to
```
# Setup the output directory
rm -rf ../test-dynamo-runner-logs-7/
mkdir ../test-dynamo-runner-logs-7/

# Commands for torchbench for device=cuda, dtype=float32 for training and for performance testing
python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2 --cold_start_latency

# Commands for torchbench for device=cuda, dtype=float32 for training and for accuracy testing
python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-7//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2
```
and running `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-7/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard`  (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else).

Sample comment: https://github.com/pytorch/torchdynamo/issues/1831#issuecomment-1309645562

NOTE: this change breaks processing old logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88790
Approved by: https://github.com/anijain2305
2022-11-10 01:45:52 +00:00
2381548071 add stride constraints to fallbacks (#88534)
Add stride/contiguity constraints to fallbacks so that inputs will be in the right stride permutation for the fallback kernel.

Improves perf of coat_lite_mini from 1.48415536054865 -> 2.010956856330101.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88534
Approved by: https://github.com/ngimel
2022-11-10 01:13:44 +00:00
fb5c6ae61f [cuDNN][cuDNN V8 API] Match V7 API behavior for channels_last stride coercion for cuDNN (#88699)
For ConvNeXt failure in https://github.com/pytorch/torchdynamo/issues/1833

cuDNN V7 has some stride "fixing" code to coerce cuDNN to use channels-last in cases when allowed by size 1 strides that was omitted in V8, which seems to seems to lead to performance regressions. This PR patches in the same fix for V8.

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88699
Approved by: https://github.com/ngimel
2022-11-10 00:49:07 +00:00
59115e6139 disable test that times out in fbcode (#88758)
Test Plan: Rely on CI.

Differential Revision: D41162966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88758
Approved by: https://github.com/zou3519
2022-11-10 00:28:02 +00:00
16bd363863 Fix dynamo dashboard passrate denominator (#88777)
Before the dashboard improvements, the passrate table looked like this:
~~~
+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 98%, 54/55 | 100%, 43/43 | 100%, 61/61 |
|       aot_eager        | 95%, 52/55 | 100%, 43/43 | 97%, 59/61  |
|     aot_cudagraphs     | 75%, 41/55 | 49%, 21/43  | 38%, 23/61  |
|    nvprims_nvfuser     | 71%, 39/55 |  16%, 7/43  | 48%, 29/61  |
|        inductor        | 87%, 48/55 | 93%, 40/43  | 95%, 58/61  |
| inductor_no_cudagraphs | 93%, 51/55 | 93%, 40/43  | 95%, 58/61  |
+------------------------+------------+-------------+-------------+
~~~
After the change, the table looked like:
~~~
+------------------------+------------+-------------+-------------+
|        Compiler        | torchbench | huggingface | timm_models |
+------------------------+------------+-------------+-------------+
|         eager          | 82%, 53/65 | 84%, 43/51  | 82%, 61/74  |
|       aot_eager        | 83%, 54/65 | 84%, 43/51  | 82%, 61/74  |
|     aot_cudagraphs     | 69%, 45/65 | 65%, 33/51  | 38%, 28/74  |
|    nvprims_nvfuser     | 48%, 31/65 | 78%, 40/51  | 26%, 19/74  |
|        inductor        | 75%, 49/65 | 82%, 42/51  | 81%, 60/74  |
| inductor_no_cudagraphs | 82%, 53/65 | 82%, 42/51  | 82%, 61/74  |
+------------------------+------------+-------------+-------------+
~~~
There is no actual regression, but the passrate is lower since the denominator is wrong. Check fix by running locally (e.g. `python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-5 --training --visualize_logs`) and comparing passrate table output to previously correct one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88777
Approved by: https://github.com/anijain2305
2022-11-10 00:26:58 +00:00
4f18739bf0 Fix Docker image generation (#88741)
Pass install channel when building nightly images
Pass `TRITON_VERSION` argument to install triton for nightly images

Fix `generate_pytorch_version.py` to work with unannotated tags and avoid failures like the following:
```
% git checkout nightly
% ./.github/scripts/generate_pytorch_version.py

fatal: No annotated tags can describe '93f15b1b54ca5fb4a7ca9c21a813b4b86ebaeafa'.
However, there were unannotated tags: try --tags.
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 120, in <module>
    main()
  File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 115, in main
    print(version_obj.get_release_version())
  File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 75, in get_release_version
    if not get_tag():
  File "/Users/nshulga/git/pytorch/pytorch-release/./.github/scripts/generate_pytorch_version.py", line 37, in get_tag
    dirty_tag = subprocess.check_output(
  File "/Users/nshulga/miniforge3/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Users/nshulga/miniforge3/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'describe']' returned non-zero exit status 128.
```
After the change nightly is reported as(due to autolabelling issue,
should be fixed by ttps://github.com/pytorch/test-infra/pull/1047 ):
```
 % ./.github/scripts/generate_pytorch_version.py
ciflow/inductor/26921+cpu
```

Even for tagged release commits version generation was wrong:
```
% git checkout release/1.13
% ./.github/scripts/generate_pytorch_version.py
ciflow/periodic/79617-4848-g7c98e70d44+cpu
```
After the fix, it is as expected:
```
% ./.github/scripts/generate_pytorch_version.py
1.13.0+cpu
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88741
Approved by: https://github.com/dagitses, https://github.com/msaroufim
2022-11-10 00:06:31 +00:00
7006ac6ee5 [Dynamo] Fix Tensor.T trace (#88642)
Summary:

Tensor.T considered T as a GetAttr and didn't progate "example_value"

Via https://pytorch.org/docs/stable/tensors.html#torch.Tensor.T
> If n is the number of dimensions in x, x.T is equivalent to
> x.permute(n-1, n-2, ..., 0).

Fixes pytorch/torchdynamo#1476

Test Plan:

pytest test/dynamo/test_functions.py::FunctionTests::test_T

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D41130306](https://our.internmc.facebook.com/intern/diff/D41130306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88642
Approved by: https://github.com/tugsbayasgalan, https://github.com/yanboliang, https://github.com/jansel
2022-11-09 23:44:30 +00:00
c7fc710459 Revert "[3/n] Thread PG: add threaded PG implementation (#88627)"
This reverts commit 6dd081846e3ae6192b375d658d4b4f3d6bd9df6e.

Reverted https://github.com/pytorch/pytorch/pull/88627 on behalf of https://github.com/huydhn due to This breaks one macos m1 test 6dd081846e in trunk. PR also fails with the same issue so I think trymerge code has a bug here letting this one merged
2022-11-09 22:38:41 +00:00
6fe4ccc7cb [ao] qconfig.py fix public v private (#87515)
Summary: made is_reuse_input_qconfig, _activation_is_memoryless,
_partial_wrapper_equals, _obs_or_fq_ctr_equals,
_add_module_to_qconfig_obs_ctr, _assert_valid_qconfig private

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40709280](https://our.internmc.facebook.com/intern/diff/D40709280)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87515
Approved by: https://github.com/jcaip
2022-11-09 22:30:03 +00:00
3a3500fa08 [13/N] Update gather with CPU/CUDA implementations (#86409)
Differential Revision: [D40181612](https://our.internmc.facebook.com/intern/diff/D40181612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86409
Approved by: https://github.com/kwen2501
2022-11-09 22:11:40 +00:00
1af9b38a90 Symintify embedding_sparse_backward (#88746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88746
Approved by: https://github.com/ezyang
2022-11-09 22:05:09 +00:00
b7aa22d6db [fx] Fix GraphModule.print_readable() (#88730)
Summary: `__nested_code()` seems removed.

Test Plan: CI

Differential Revision: D41149662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88730
Approved by: https://github.com/SherlockNoMad
2022-11-09 21:39:48 +00:00
6dd081846e [3/n] Thread PG: add threaded PG implementation (#88627)
Summary: After the previous 2 diffs, finally we can add the threaded ProcessGroup implementation.

Test Plan: TBD

Reviewed By: XilunWu

Differential Revision: D40992593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88627
Approved by: https://github.com/XilunWu, https://github.com/H-Huang
2022-11-09 20:51:11 +00:00
93d3bd626e Revert "[primTorch] Improve narrow and narrow_copy: refs, tests, docs (#87045)"
This reverts commit aa8279bcb8687e025a666e18828a436eb7ef7b45.

Reverted https://github.com/pytorch/pytorch/pull/87045 on behalf of https://github.com/izaitsevfb due to BC-breaking change, D41161182
2022-11-09 20:48:32 +00:00
8523c45717 Delete stub file to enable mypy check (#4649) (#88701)
Summary:
X-link: https://github.com/facebookresearch/detectron2/pull/4649

Context in https://fburl.com/4irjskbe

This change deletes distributed.pyi, so that lintrunner will run mypy on distributed.py for typing check.

Test Plan: CI

Differential Revision: D41028360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88701
Approved by: https://github.com/zhaojuanmao
2022-11-09 20:29:34 +00:00
133e61af7a OpOverload is_view (#88722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88722
Approved by: https://github.com/ezyang
2022-11-09 19:03:12 +00:00
55df18e3da [12/N] Update scatter with CPU/CUDA implementations (#86408)
Differential Revision: [D40181613](https://our.internmc.facebook.com/intern/diff/D40181613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86408
Approved by: https://github.com/kwen2501
2022-11-09 18:40:25 +00:00
3a1bdfee67 skip environment collection test in fbcode (#88744)
Summary: This runs pip, which we don't have in the fbcode environment.

Test Plan: Rely on CI.

Differential Revision: D41156589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88744
Approved by: https://github.com/zou3519
2022-11-09 18:20:04 +00:00
de53d4143a Fix TorchInductor benchmarking in fbcode (#88689)
Summary: Makes the C++ TorchInductor benchmarking work in fbcode plus some minor fixed to enable that.

Test Plan: Test added

Differential Revision: D41045910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88689
Approved by: https://github.com/soumith
2022-11-09 18:13:06 +00:00
c4a3aa8fe7 [vulkan] Add option for buffer representations in vTensor (#87622)
This diff adds the option to use a Buffer to store data for a `vTensor` by passing `StorageType::BUFFER` to the constructor of `vTensor`. To enable this change, the construction of `vTensor` and `vTensorStorage` had to be slightly refactored to properly support strides. To summarize the changes:

* `vTensorStorage` now contains no Tensor metadata (such as tensor sizes, strides, and `TensorOptions`) - it now only contains the image extents (if texture storage is used) and the buffer length. Tensor metadata is now managed by `vTensor`. The reason for this is to allow multiple `vTensor` objects to point to the same `vTensorStorage` but with different metadata which may be a useful feature now that Buffer storage is enabled.
* `vTensor` will now compute the strides upon construction based on the requested sizes and memory layout if Buffer storage is requested. Previously, strides were faked by setting them all to 0 as strides do not apply to image textures (this behavior is preserved for texture storage).

Differential Revision: [D40604163](https://our.internmc.facebook.com/intern/diff/D40604163/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87622
Approved by: https://github.com/digantdesai
2022-11-09 17:59:49 +00:00
d81797e845 Meta function for aten.sort and aten.scatter* (#88705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88705
Approved by: https://github.com/ezyang
2022-11-09 17:47:14 +00:00
100b55637b Mark dynamo torchbench dlrm as unsupported (#88712)
- DLRM requires special configuration of embedding layers which are sparse
  and not compatible with DDP.
- I could mark the embedding params as ignored in DDP
  to make the benchmark pass, but this isn't a representative benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88712
Approved by: https://github.com/ezyang
2022-11-09 17:23:56 +00:00
eb9b156019 [fix] MathBits: serialization (#88182)
Fixes #81690

TODO:

* [x] C++ Unpickler Fix (locally tested pickled in Python and unpickled in C++)
* [x] C++ Pickler Fix (locally tested pickled in C++ and unpickled in Python)
* [x] Do quant_tensor, sparse_tensor, etc require similar changes? (Sparse and Quant don't need this)
* [x] Add Comments
* [x] How to make sure C++ and Python are in sync? (Functions in `pickler.h` help in getting and setting Tensor Metadata (math-bits for now) on a tensor. They are the only place which should handle this.)

Notes:
Quant Tensor don't support complex dtypes and for float they segfault with `_neg_view` : https://github.com/pytorch/pytorch/issues/88484

Sparse Tensor:
```python
>>> a = torch.tensor([[0, 2.], [3j, 0]]).to_sparse()
>>> a.conj().is_conj()
False
>>> a._neg_view()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: Cannot access storage of SparseTensorImpl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88182
Approved by: https://github.com/ezyang, https://github.com/anjali411
2022-11-09 17:15:12 +00:00
525fe53aa4 [BE] Delete push_nightly_docker_ghcr (#88748)
As it seems to be duplicating the functionality of `docker-release.yml` and have not produced a valid build in last 16 days, according to https://github.com/pytorch/pytorch/actions/workflows/push_nightly_docker_ghcr.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88748
Approved by: https://github.com/seemethere
2022-11-09 16:13:56 +00:00
f11f0e4a03 [inductor] Handle nested tuple/list output in fallback kernel (#88495)
Summary: Currently fallback kernel in inductor assumes its output is
either a tensor or a tuple/list of tensors. This PR makes it handle more
generic output data structure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88495
Approved by: https://github.com/jansel
2022-11-09 15:50:45 +00:00
3150c9dc6f extract out the clean workspace test to its own file (#88682)
Summary:
This test relies on what the root workspace is before any other code
is run. However, some of the test cases change it. If the order the
tests are run is randomized, then the test can fail if run after one
of them.

Having it on its own ensures that it always sees a pristine state.

Test Plan:
Verified locally and confirmed in internal and external CI.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88682
Approved by: https://github.com/r-barnes, https://github.com/malfet
2022-11-09 13:48:57 +00:00
c19bae9f84 Add SherlockNoMad to symbolic-shapes reviewer list (#88739)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88739
Approved by: https://github.com/anjali411
2022-11-09 13:20:19 +00:00
44de7cdbc4 Add voznesenskym to symbolic-shapes group, move wconstab to listener (#88593)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88593
Approved by: https://github.com/anjali411
2022-11-09 13:11:16 +00:00
c86cc68d23 Mark diag.out composite (#88670)
It's implementation just redispatches, it works for more than CPU/CUDA.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88670
Approved by: https://github.com/anjali411
2022-11-09 12:59:07 +00:00
69b2352236 Add min cut partitioner for AOT+nvFuser (#88204)
Here we mark most of `torch.ops.nvprims` as something that can be recomputed in the backward passes (and hopefully fused).

TODO:
- [x] Add a test after https://github.com/pytorch/pytorch/pull/88186 is merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88204
Approved by: https://github.com/jjsjann123, https://github.com/jansel
2022-11-09 12:56:55 +00:00
ff7c5b0df8 Changing as_strided_scatter to deterministic inputs (#85583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85583
Approved by: https://github.com/mruberry
2022-11-09 12:40:03 +00:00
fca6ed02b9 [Inductor] fix c++ compile error with masked float value init (#88298)
Fixes #88201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88298
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-09 10:40:25 +00:00
652af5ec15 upsample_*.vec ops are now CompositeImplicit (#85638)
It was previously CompositeExplicit but it was not really necessary.
See discussion in https://github.com/pytorch/pytorch/issues/85405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85638
Approved by: https://github.com/ezyang, https://github.com/lezcano, https://github.com/malfet, https://github.com/jansel
2022-11-09 09:58:04 +00:00
aa8279bcb8 [primTorch] Improve narrow and narrow_copy: refs, tests, docs (#87045)
Fixes #87019.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87045
Approved by: https://github.com/mruberry
2022-11-09 09:19:28 +00:00
f6192b75c6 [Quant] Support lowering of channel shuffle in FX (#83731)
## Description
Support lowering of channel shuffle in FX by adding its module and functional op to `is_copy_node` list in `torch/ao/quantization/fx/_lower_to_native_backend.py`

## Validation
UTs added to test
- correctness of quantized `ChannelShuffle` module.
- FX lowering of `ChannelShuffle` module and functional `channel_shuffle`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83731
Approved by: https://github.com/jerryzh168
2022-11-09 08:08:11 +00:00
ab9a19a95b [BE] Move setup-ssh step ahead of clone PyTorch (#88715)
It allows one to SSH faster rather than having to wait for repo clone to
finish.

I.e. right now one usually have to wait for a few minutes fore PyTorch clone is finished, but with this change you can SSH ahead of time (thanks to `setup-ssh` being a composite action

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88715
Approved by: https://github.com/clee2000, https://github.com/izaitsevfb
2022-11-09 06:55:22 +00:00
a7420d2ccb Hopper (sm90) support (#87736)
Essentially a followup of #87436

CC @xwang233 @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87736
Approved by: https://github.com/xwang233, https://github.com/malfet
2022-11-09 01:49:50 +00:00
19d7941e37 Fix Python-bound function signature (torch._C.Graph.addInput) (#88528)
In pytorch/torch/_C/__init__.pyi, Graph.addInput has signature
```python
  def addInput(self, name: str) -> Value: ...
```
which doesn't match the corresponding function
```cpp
  Value* addInput(const std::string& name = "") {
    return block_->addInput(name);
  }

```

in python_ir.cpp. This PR aligns the bound function on both C++ and Python sides. Without this PR, mypy will compain whenever a change contains some calls to `addInput`; for example,
![image](https://user-images.githubusercontent.com/3524474/200092086-429b8d63-9321-4d03-b0d6-f4c9bd361756.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88528
Approved by: https://github.com/davidberard98
2022-11-09 01:31:45 +00:00
f0e6cea2ed Meta registrations for inplace operators (#88678)
Also, handle non-default alpha correctly.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88678
Approved by: https://github.com/SherlockNoMad, https://github.com/albanD
2022-11-09 01:27:01 +00:00
a880ddc164 Meta implementation for unsqueeze_ (#88675)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88675
Approved by: https://github.com/SherlockNoMad
2022-11-09 01:27:01 +00:00
1dab35ca1b Meta implementation for bernoulli (#88676)
For some reason bernoulli uses legacy memory format, see linked issue.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88676
Approved by: https://github.com/SherlockNoMad
2022-11-09 01:26:58 +00:00
6be426ca1a Update gloo submodule (#88530)
Also, add an explicit cudart dependency to `torch_cuda` if Kineto is used with GPU support (it used to be somehow inherited from a wrong `gloo` setup)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88530
Approved by: https://github.com/osalpekar
2022-11-09 01:04:32 +00:00
08b2a251e1 [export] Preserve meta["val"] on placeholders in dynamo.export(). (#88651)
Summary:
Today when we transform the captured graph in the last step in export(aten_graph=True), we construct a new graph which doesn't have the all the metadata to be preserved, for example, node.meta["val"].
meta["val"] is important for writing passes and analysis on the graph later in the pipeline, we may want to preserve that on placeholder nodes.

Test Plan: test_export.py:test_export_meta_val

Differential Revision: D41110864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88651
Approved by: https://github.com/tugsbayasgalan, https://github.com/jansel
2022-11-09 01:02:09 +00:00
5f876bfdc5 Reduce the number of shards inductor uses for model tests (#88610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88610
Approved by: https://github.com/huydhn
2022-11-09 00:54:54 +00:00
9f58e027a9 Add implementation for irregular dimension selection for nested tensors. (#88585)
Summary: This diff modifies the implementation of the select operator so slices of the irregular dimension can be selected (e.g. nt[:,0,:]).

Test Plan:
Added new unit tests to test that the new functions work as intended (see them in diff). To test,
`buck test mode/dev-nosan //caffe2/test:nested`

Differential Revision: D41083993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88585
Approved by: https://github.com/cpuhrsch
2022-11-09 00:19:38 +00:00
87238e6491 [nn] add remove_duplicate flag to named_parameters (#759) (#88090)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/759

Since the remove_duplicate flag was added to named_buffers in D39493161 (c12f829cce), this adds the same flag to named_parameters

Test Plan:
python test/test_nn.py -k test_buffers_and_named_buffers

OSS Tests

Differential Revision: D40801899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88090
Approved by: https://github.com/albanD
2022-11-09 00:09:20 +00:00
cef13ebea0 [Profiler] Memory profiler part 1: Gradient identification (#86802)
There are multiple ways to indentify that a Tensor is a gradient. (A subset of which also give additional context.) So to start off I've made a utility to handle that determination.

Differential Revision: [D39920730](https://our.internmc.facebook.com/intern/diff/D39920730/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86802
Approved by: https://github.com/chaekit
2022-11-08 23:53:13 +00:00
c0e6b4329f [dynamo] only error out on nested fx trace if dynamo is optimizing (#88640)
I think this is the final resolution to issue caused by
https://github.com/pytorch/pytorch/pull/87797. The nvfuser issue that PR
tripped up was because, even though we're correctly disabling
torchdynamo via a `DisableContext`, the nested fx trace check was still
firing. This PR properly narrows it to only fire if we're not disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88640
Approved by: https://github.com/yf225
2022-11-08 23:52:21 +00:00
a02ea655b5 Slight fix in error message for check_for_seq_len_1_nested_tensor (#88690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88690
Approved by: https://github.com/cpuhrsch
2022-11-08 22:14:21 +00:00
6e6f929b2c [Profiler] Restructure inputs and capture TensorLists. (#87825)
This PR unifies and rationalizes some of the input representation in Result. The current approach of storing separate types in separate vectors is tedious for two types (Tensors and scalars), but would be even more annoying with the addition of TensorLists. A similar disconnection exists with sizes and strides which the user is also expected to zip with tensor_metadata.

I simplified things by moving inputs to a variant and moving sizes and strides into TensorMetadata. This also forced collection of sizes and strides in python tracer which helps to bring it in line with op profiling. Collection of TensorLists is fairly straightforward; `InputOutputEncoder` already has a spot for them (I actually collected them in the original TorchTidy prototype) so it was just a matter of plumbing things through.

Differential Revision: [D40734451](https://our.internmc.facebook.com/intern/diff/D40734451/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87825
Approved by: https://github.com/slgong-fb, https://github.com/chaekit
2022-11-08 21:48:43 +00:00
e132c45fd0 [Profiler] Handle ABA for TensorImpl* when assigning IDs (#87133)
Part of the current ID assingment algorithm groups any Storages which are associated with the same TensorImpl*. This isn't sound (which I knew but deferred until it actually became a problem) because pointers can be reused by different objects. (ABA problem)

ABA is easy to handle for Storage because we see allocations and frees, but ~TensorImpl is very hot and cannot tolerate profiling code without significant increases in overhead.

This PR narrows the conditions under which ID assignment will join on TensorImpl*. Two storages which are associated with the same TensorImpl* are grouped IFF they were live at the same time. (Note that this still allows storages with disjoint lifetimes to be joined transitively through a third storage which overlaps with both.)

The need for this PR arose in memory profiling. The Python argument parser creates short lived Tensors for (some) scalar arguments which triggers this issue. (Which is stochastic and platform dependent since optimizations like reusing recently freed allocations is implementation defined.) Spurious connections can lead to confusing and long range interactions when building up the memory profile, so it makes sense to harden ID assignment to avoid any issues.

Differential Revision: [D40445121](https://our.internmc.facebook.com/intern/diff/D40445121/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87133
Approved by: https://github.com/slgong-fb, https://github.com/chaekit
2022-11-08 21:48:43 +00:00
078c25df13 [MPS][BE] Code cleanup (#88529)
Various code cleanup in MPS operations:
 - Per @kulinseth suggestion move `mpsSupportsCumsum` to `MPSDevice.h` and rename it to
   `is_macos_13_or_newer()`
 - Move Ventura MPSGraph new operators to `MPSGraphVenturaOps.h` header
 - Use `LookupAs` and `CreateCachedGraphAs` to make code more compact
 - Formatting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88529
Approved by: https://github.com/kulinseth
2022-11-08 21:10:07 +00:00
1d82eba98b PatternMatcher supports matching list-typed args (#88656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88656
Approved by: https://github.com/jerryzh168
2022-11-08 21:05:18 +00:00
8e2627d42f [inductor] Fix aten.fmod lowering (#88602)
Currently the lowering for aten.fmod promotes integral types to float and calls
`tl.libdevice.fmod` whereas the ATen behavior is to use the modulo operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88602
Approved by: https://github.com/jansel
2022-11-08 20:27:36 +00:00
f556d73574 [torch] Implement aten::native_batch_norm.out for CPU (#88604)
Summary:
Implement `native_batch_norm.out` for CPU. Reuses the main logic for `native_batch_norm` but extract out the Tensor creation logic for outputs. There are 3 outputs: `output`, `save_mean` and `save_var`. `batch_norm_cpu` calls `batch_norm_cpu_update_stats_template` to get `save_mean` and `save_var`, and then calls into `batch_norm_cpu_transform_input_template` which initializes `output`.

In the implementation of `batch_norm_cpu_out`, I did the following:

* Let `batch_norm_cpu_transform_input_template` to take another argument `output`, ask the call sites to pass in a output Tensor.

* Overload `batch_norm_cpu_update_stats_template` to take `save_mean` and `save_var`, ask the call sites to pass in those Tensors.

* In `batch_norm_cpu_out`, pass `output`, `save_mean` and `save_var` all the way to our new `batch_norm_cpu_transform_input_template` and `batch_norm_cpu_update_stats_template`.

* In `batch_norm_cpu`, prepare for these outputs and call `batch_norm_cpu_out`.

Test Plan: Enable unit tests for `native_batch_norm.out`.

Differential Revision: D40992036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88604
Approved by: https://github.com/iseeyuan, https://github.com/jjsjann123
2022-11-08 19:53:11 +00:00
3e30a9ea1c Fix CUDA_MAX_THREADS_PER_SM for sm_87 (#88644)
#88326
CC @ngimel @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88644
Approved by: https://github.com/ngimel
2022-11-08 19:44:23 +00:00
6bb7f4f29f Minor error message improvements on meta functions (#88677)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88677
Approved by: https://github.com/SherlockNoMad
2022-11-08 19:16:29 +00:00
d98a884b33 Revert "[cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)"
This reverts commit 3c6bddc3f6347ce7d1ed33aee94cdaa953cbc387.

Reverted https://github.com/pytorch/pytorch/pull/87669 on behalf of https://github.com/eqy due to investigating convnext benchmark regressions
2022-11-08 19:04:25 +00:00
5eecfcf5f3 Run libtorch trunk build on linux.4xlarge (#88683)
Add optional `runner`  input to `_linux-build.yml`
Move `libtorch-linux-bionic-cuda11_6-py3_7-gcc7-build` to `linux.4xlarge` as it occasionally OOMS on 2xlarge one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88683
Approved by: https://github.com/atalman, https://github.com/weiwangmeta
2022-11-08 18:52:56 +00:00
eaf4fe3d2b Most recently used cache management for TorchDynamo (#88076)
Modify the lookup procedure for TorchDynamo caches to keep the head of the single linked list as the most recently used cache entry, which may potentially improve probability for cache hitting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88076
Approved by: https://github.com/jansel
2022-11-08 18:46:59 +00:00
1b5373fc83 Mark as_strided_ as supporting SymInt in C++ (#88674)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88674
Approved by: https://github.com/anjali411
2022-11-08 18:45:05 +00:00
dba887766b Revert "torchdynamo support modules() for nn_module (#88023)"
This reverts commit 96104c7b7e908634a473792b6b2e9279d79d23d8.

Reverted https://github.com/pytorch/pytorch/pull/88023 on behalf of https://github.com/ydwu4 due to [Internal breakages] https://www.internalfb.com/intern/sandcastle/job/9007200067589062/
2022-11-08 18:37:48 +00:00
860e354d1c Support diag_embed.out decomposition (#88671)
This is a little tricky: there is a diag_embed.out, but its not bound
in Python because it's autogenerated, see https://github.com/pytorch/pytorch/issues/88598
So I can't "just" add the out variant to the ref, as this makes it
inconsistent with the torch API.  To workaround this, I mark the ref
as supporting out, but not the original function.

This is useful to do, because it means that diag_embed.out now supports
symbolic shapes.  However, this cannot be easily tested because
I can't mark the out variant as being supported in the normal OpInfo test.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88671
Approved by: https://github.com/mruberry
2022-11-08 18:28:36 +00:00
3f6a560184 Correctly test that dtype/device match in generated .out kernels for composites (#88672)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88672
Approved by: https://github.com/anjali411
2022-11-08 18:28:36 +00:00
245144a636 Propagate layout and pin memory in randint to inner constructor (#88673)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88673
Approved by: https://github.com/anjali411
2022-11-08 18:22:30 +00:00
96104c7b7e torchdynamo support modules() for nn_module (#88023)
Differential Revision: D40820879

This diff allows models to call self.modules() during dynamo tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88023
Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym, https://github.com/jansel
2022-11-08 18:22:03 +00:00
ee28b865ee Deprecate TypedStorage, its derived classes, and all of their public methods (#85303)
Part of #85302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85303
Approved by: https://github.com/ezyang
2022-11-08 18:11:01 +00:00
53ca5ad347 enable scalar reduction with dim=-1 (#88628)
Tested with all samples for `sum`, but also fixes all samples errors on other reductions (amin, amax, any, all etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88628
Approved by: https://github.com/desertfire
2022-11-08 17:06:28 +00:00
89c5819626 Dynamo DDP accuracy bench uses find_unused_parameters (#88645)
- find_unused_parameters adds a slight overhead, but is required
  in cases where users do not manually specify parameters to ignore
  which will not receive grads.  In some models, some parameters
  do not receive grads, and this causes DDP to throw an exception
  as it waits for a grad for each parameter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88645
Approved by: https://github.com/soumith
2022-11-08 16:13:10 +00:00
fcc2883476 Clean up SymFloat binding to cover all functions (#88370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88370
Approved by: https://github.com/ezyang
2022-11-08 14:32:47 +00:00
6abaa5946d Fix categorization of sym_int method (#88369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88369
Approved by: https://github.com/ezyang, https://github.com/bdhirsh, https://github.com/anjali411
2022-11-08 14:32:47 +00:00
bc66ddb5cb Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)
Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8

Changes:
- introduce new error type
- Update `C10D_NCCL_CHECK`

Sample script to demonstrate new error type

```python
# python -m torch.distributed.run --nproc_per_node=2 <script>.py

import torch
import torch.distributed as dist

if __name__ == "__main__":
    dist.init_process_group("nccl")
    dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
```

Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
Approved by: https://github.com/rohan-varma
2022-11-08 13:26:42 +00:00
1a7c4b0de7 Create _make_alias to preserve the name of a function when creating an alias (#88114)
Before, we would inherit the name of the aliased function, which was
very confusing, and disallowed some homogeneous treatment of references,
as we do later in this stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88114
Approved by: https://github.com/mruberry
2022-11-08 13:09:34 +00:00
af09270e10 nvprims bookend non compute (#88457)
Cherry-pickeding: https://github.com/csarofeen/pytorch/pull/2099

1. enabling bookend non-compute-ops pass on nvfuser
2. fixing bookend op check on intermediate tensor as partition inputs
3. python tests added for: `getitem` special handling bookend_non_compute removal
4. patching dfs by excluding dfs within partition to avoid going over recursion limitation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88457
Approved by: https://github.com/SherlockNoMad
2022-11-08 12:06:35 +00:00
8cb5c5543e Revive static_runtime_benchmark build and test (#87660)
This build uses the wrong BUILD_ENVIRONMENT `pytorch-linux-focal-py3`, thus it hasn't been run for a long time (forgotten). The name was probably the old name of the build environment we used in the past.  The convention today doesn't have the `pytorch-` prefix. There is a TODO for this:

> TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this.

This is done as part of [T131829540](https://www.internalfb.com/intern/tasks/?t=131829540), where we want
 `static_runtime_benchmark` build and test jobs to run  in OSS CI to avoid breaking internal

* I also fix some compiler warning errors `-Werror=sign-compare`, `-Werror,-Wunused-const-variable`, and gcc7 compatibility issue along the way because this hasn't been run for a long time.
* Reviving this test also reveals a small bug in `PrepackWeights` test in `test_static_runtime.cc` added recently in https://github.com/pytorch/pytorch/pull/85289. The test refers to an internal ops and should only be run internally. This has been fixed by https://github.com/pytorch/pytorch/pull/87799 (To be merged)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87660
Approved by: https://github.com/malfet
2022-11-08 08:32:45 +00:00
02c1a304fa [ci] increase timeout time of ios test app build (#88611)
We were timing out; 5 minutes seems a bit short.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88611
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/ZainRizvi
2022-11-08 06:29:11 +00:00
8f66ae413f [Autograd] Use in-place input accumulation fast path for dense Tensors. (#88339)
There is a fast path in InputBuffer to steal memory when use count is zero, however it is only used for sparse Tensors. According to Natalia, this is just because it wasn't obvious that there would be a benefit for dense Tensors so there was no reason to live dangerously. However I've noticed large Tensors in internal models which would benefit from this optimization as well.

Differential Revision: [D40946601](https://our.internmc.facebook.com/intern/diff/D40946601/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88339
Approved by: https://github.com/ngimel
2022-11-08 05:37:43 +00:00
ffb6e68962 Add missing args to DDP constructor in distributed.pyi (#88209)
Summary: As title. And remove all unnecessary `pyre-fixme` for the unknown arg in call-site.

Test Plan: CI

Differential Revision: D40874013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88209
Approved by: https://github.com/zhaojuanmao
2022-11-08 05:12:18 +00:00
ced71e8e82 [Pytorch] add an option to disable TORCH_WARN and TORCH_WARN_ONCE log (#87188)
Summary: Add an option to disable TORCH_WARN, some op could trigger spammy TOCH_WARN log which is not desired under certain scenario.

Test Plan:
Tested with
-pt.disable_warn = 1 and -pt.disable_warn = 0

verified TORCH_WARN and TORCH_WARN_ONCE are properly handled

tested with
-pt.strip_error_messages = 1, -pt.disable_warn = 0

verified strip error message is respected when warn is printed

Differential Revision: D40321550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87188
Approved by: https://github.com/kurtamohler, https://github.com/ezyang
2022-11-08 04:49:45 +00:00
ed97e0aa29 [vision hash update] update the pinned vision hash (#88465)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88465
Approved by: https://github.com/pytorchbot
2022-11-08 03:29:55 +00:00
9f11ce7f67 Setting pickle_module isn't working (#88570)
When setting the pickle_module it currently always gets overwritten by the pickle module. This should only happen when the pickle_module isn't specified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88570
Approved by: https://github.com/kit1980
2022-11-08 03:26:46 +00:00
825f4e602b Add support for symbolic shapes to sparse tensor (#88573)
Along the way, I undid making sparse/dense dim symint (they're
dimensions, so they should be static.)

Also symintify set_indices_and_values_unsafe

There is a little bit of a nontrivial infra change here: previously, we didn't populate the strides field on sparse tensors. It is now populated with "empty" strides, and this meant that sparse tensors were falsely reporting they were non-overlapping dense/contiguous. I added in a hack to work around this case.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88573
Approved by: https://github.com/anjali411
2022-11-08 03:13:42 +00:00
c29502dd2f [LTC] Remove view (#88445)
Summary:
This pull request removes the last view ops, the original view.

Test Plan:
./build/bin/test_lazy --gtest_filter=LazyOpsTest.TestView*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88445
Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/Krovatkin
2022-11-08 02:22:02 +00:00
f2000842a8 Do not use double for single-prec upsample (#88277)
I'm not sure, what would be the best behaviour here, but it feels a bit strange to perform parts of `float32` computations as `float64` and then downcast them back to `float32`.

Use `at::opmath_type` rather than `at:acc_type` as no accumulation is used in the op.

I don't know much about double vs single precision scalar perf on x86 CPU, but before the change:
```
python -c "import timeit;import torch;x=torch.arange(100, dtype=torch.float32).reshape(1, 1, 10, 10); print(timeit.Timer(stmt='torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"bilinear\", align_corners=False)', globals={'x':x, 'torch':torch}).timeit())"
11.337517574429512
```
After the change:
```
$ python -c "import timeit;import torch;x=torch.arange(100, dtype=torch.float32).reshape(1, 1, 10, 10); print(timeit.Timer(stmt='torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"bilinear\", align_corners=False)', globals={'x':x, 'torch':torch}).timeit())"
10.513805857859552
```
I.e. roughly 7% perf degradation (measured on Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz)

NOTE:
 - `aten::acc_type<float, false>` yields `double`
 - `aten::acc_type<float, true>` return `float`.

Fixes https://github.com/pytorch/pytorch/issues/87968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88277
Approved by: https://github.com/mingfeima, https://github.com/ngimel, https://github.com/jgong5
2022-11-08 01:46:25 +00:00
4ea2310f1e Fix typos used in documents under torch directory (#88483)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html.
This is a follow-up of #88300.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88483
Approved by: https://github.com/kit1980
2022-11-08 01:33:36 +00:00
d25be63c05 [Reland] Use sudo when reset NVIDIA devices (#88605)
I accidentally delete my remote branch, so I need to create a new PR for this fix (instead of updating the reverted PR https://github.com/pytorch/pytorch/pull/88531)

TIL, sudo echo doesn't do that I think it does, the correct syntax should be `echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset` granting sudo permission to the latter tee command.

### Testing

Due diligence and actually login to `i-07e62045d15df3629` and make sure that the command works
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88605
Approved by: https://github.com/ZainRizvi
2022-11-08 01:17:35 +00:00
c77368d416 Implement a constructor for nested_tensor that is similar to torch.tensor() (#88213)
Summary: This diff merges both previous implementations of constructors for nested tensors, the one from lists of tensors and the one with arbitrary python lists, adn implements it in pytorch core so no extensions are needed to construct NT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88213
Approved by: https://github.com/cpuhrsch
2022-11-08 00:03:18 +00:00
72a7351993 Pin linux ninja dep to 1.10.2 (#88548)
The latest version 1.11.1 breaks PyTorch CI.  A bunch of tests are failing now in master d1ee073041.  Curiously, the latest commit 81042d3a53 looks green, but it's good to pin this dependency anyway

https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L95-L97 has a curious note about ninja and why it's not part of the docker container (need to revisit this later on):

```
#ninja
#Description: build system.  Note that it install from
#here breaks things so it is commented out
```

This is one more reason to justify the effort to consolidating all pip and conda dependencies to get rid of this family of issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88548
Approved by: https://github.com/clee2000
2022-11-07 23:53:17 +00:00
fdf2865108 Use test/test-reports for inductor (#88533)
So that the test reports can be picked up automatically by the CI and uploaded to S3. Later on, this will allows the querying of these test reports from our Rockset DB.

For example https://github.com/pytorch/pytorch/actions/runs/3382363153/jobs/5617382531 `Upload test statistics` shows:

```
+ python3 -m tools.stats.print_test_stats --upload-to-s3 --compare-with-s3 test
No tests in reports found in test
```

678d038001 inductor artifacts are also empty zip at the moment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88533
Approved by: https://github.com/desertfire
2022-11-07 23:49:21 +00:00
eb3f975c6e Fix segfault in has_torch_function (#88559)
Fixes #83908

`PySequence_Fast` may return `NULL` to indicate an error was raised, in which
case `sequence_has_torch_function` will dereference a null pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88559
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/hameerabbasi
2022-11-07 23:48:39 +00:00
4796e23bbb Fix pull docs build running with a schedule and increase cpp doc timeout to 4h (#88589)
* After https://github.com/pytorch/pytorch/pull/88373, pull workflow can now be triggered with a schedule. This changes the assumption in the doc build workflow when schedule event is used to determine if the docs should be pushed
* I'll create a follow-up issue to see if it's possible to improve the performance of cpp doc build job.  At the moment, it uses a linux.12xlarge runner and still couldn't finish the job after 3h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88589
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2022-11-07 23:05:14 +00:00
d453b3c4d4 Add a note on the stability of linalg functions. (#88313)
This was long-due, as it keeps comming up in issues.

Fixes https://github.com/pytorch/pytorch/issues/85950
Fixes https://github.com/pytorch/pytorch/issues/59720
Fixes https://github.com/pytorch/pytorch/issues/59782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88313
Approved by: https://github.com/soumith, https://github.com/mruberry
2022-11-07 22:44:23 +00:00
b00c43b310 Revert "fallback for scatter_(scalar) (#88210)"
This reverts commit 896fa8c5c9b0191c9621e04ab5e20057614d48ad.

Reverted https://github.com/pytorch/pytorch/pull/88210 on behalf of https://github.com/suo due to this broke inductor tests, see: 896fa8c5c9
2022-11-07 22:29:56 +00:00
0e67b2f7dd Dynamo Dashboard Improvements (#88516)
Implement various features in https://github.com/pytorch/torchdynamo/issues/1644:
- Upload nightly run logs to /fsx before parsing - for backing up parsing failures.
- Flag models with (1) < 0.95x speedup, (2) > 2min compile time, (3) < 0.9x compression ratio
- Flag models that were passing yesterday but failed today.
- Other small bug fixes.

See https://github.com/pytorch/torchdynamo/issues/1831 for sample outputs.
Also tested by running run.sh:
```bash
# Setup the output directory
rm -rf ../test-dynamo-runner-logs-3/
mkdir ../test-dynamo-runner-logs-3/

# Commands for torchbench for device=cuda, dtype=float32 for training and for performance testing
python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=../test-dynamo-runner-logs-3//inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2 --cold_start_latency

# Commands for torchbench for device=cuda, dtype=float32 for training and for accuracy testing
python benchmarks/dynamo/torchbench.py --accuracy --float32 -dcuda --output=../test-dynamo-runner-logs-3//inductor_torchbench_float32_training_cuda_accuracy.csv --training --inductor   --no-skip --dashboard --only mobilenet_v2
```

with the command
`python benchmarks/dynamo/runner.py --output-dir ../test-dynamo-runner-logs-3/ --dashboard-archive-path /data/home/williamwen/dynamo-runner-logs-copy --training --run --compilers inductor --flag-compilers inductor --suites torchbench --update-dashboard` (need to comment out the `generate_commands` line and change the github issue ID from 681 to something else).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88516
Approved by: https://github.com/anijain2305
2022-11-07 22:24:44 +00:00
b14e06503a (fix): Add some missing std::moves to C10 (#88512)
I saw some missed optimization opportunities in C10 using std::move and thought I would submit a PR to fix them. There are particularly a lot of them dealing with the symbolic operators which are used in quite a few places including in loops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88512
Approved by: https://github.com/ezyang
2022-11-07 22:17:13 +00:00
d8506ff42b Generalize gesvdjBatched to run whith full_matrices==false (#88502)
As brought up in https://github.com/pytorch/pytorch/issues/86234#issuecomment-1268296036, our heuristic for which SVD backend to choose was not great in some cases.
The case in which there could be some improvements is when we have a
large batch of very small non-square matrices.

This PR, adapts the calling code to gesvdj by creating two temporary
square buffers to allow to call gesvdjBatched, and then copies back the
result into the output buffers.

We then modify the heuristic that chooses between gesvdj and
gesvdjBatched.

Fixes https://github.com/pytorch/pytorch/issues/86234
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88502
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved, https://github.com/mruberry, https://github.com/xwang233
2022-11-07 22:07:48 +00:00
9dadf8fcc2 [DataPipes] Add group support to the sharding_filter (#88424)
Differential Revision: [D41006747](https://our.internmc.facebook.com/intern/diff/D41006747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88424
Approved by: https://github.com/ejguan
2022-11-07 22:07:01 +00:00
23a3eb37cf SymIntify _copy functionalization kernels (and _copy_out too) (#88572)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88572
Approved by: https://github.com/anjali411, https://github.com/bdhirsh
2022-11-07 21:40:10 +00:00
896fa8c5c9 fallback for scatter_(scalar) (#88210)
`scatter_reduce_` overloads can only accept `Tensor src`.
`scatter_`, on the other hand, can accept `Number src`. Switching a fallback from `scatter_reduce_` to `scatter_`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88210
Approved by: https://github.com/desertfire
2022-11-07 21:25:55 +00:00
0a69c50a46 Publicly expose _LRScheduler to LRScheduler (#88503)
Fixes #61232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88503
Approved by: https://github.com/soulitzer
2022-11-07 21:15:10 +00:00
05b9e8ec00 Upload test stats for inductor workflow (#88535)
We miss this new workflow, so none of its test stats are uploaded to rockset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88535
Approved by: https://github.com/desertfire
2022-11-07 21:04:02 +00:00
a37524085d [torchdynamo] support torch.autograd._profiler_enabled (#88378)
fix https://github.com/pytorch/torchdynamo/issues/1826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88378
Approved by: https://github.com/voznesenskym
2022-11-07 20:36:26 +00:00
95d57b54e0 Handle pin_memory in refs.randn (#88473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88473
Approved by: https://github.com/mruberry
2022-11-07 20:25:56 +00:00
bf49dada1e [nvfuser] skip extremal tests on rocm (#88587)
Summary:

These are failing in rocm so disable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88587
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2022-11-07 20:03:27 +00:00
7bf9db81c5 Revert "Use sudo when reset NVIDIA devices (#88531)"
This reverts commit 505486ce9321bc22d2156a1a9b97fe474a05b53b.

Reverted https://github.com/pytorch/pytorch/pull/88531 on behalf of https://github.com/huydhn due to Wrong sudo echo usage, should use tee instead
2022-11-07 19:59:42 +00:00
78a0ca29d9 Revert "[fix] allow saving python attr on Tensor and Parameter via torch.save (#81616)"
This reverts commit 54b6188cc6dee45b775d688223b847dc8ea85bff.

Reverted https://github.com/pytorch/pytorch/pull/81616 on behalf of https://github.com/mehtanirav due to Internal publishing is broken
2022-11-07 18:51:16 +00:00
91a4039842 [exir][fx] PassManager error handling (#88520)
Summary:
* Added an error message for when the result is not a PassResult
* Modified the error handling to capture exceptions that happen in the check() function
* consolidated inplace_wrapper and pass_result_wrapper

Test Plan: CI

Differential Revision: D40950135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88520
Approved by: https://github.com/SherlockNoMad
2022-11-07 18:42:41 +00:00
bd1ffc6501 [Dynamo] Fix bug: GradMode doesn't carry grad state correctly after graph break (#88537)
Fixes https://github.com/pytorch/torchdynamo/issues/1446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88537
Approved by: https://github.com/jansel
2022-11-07 18:03:31 +00:00
6663ae5537 [2/n] Thread PG: add class _World to distributed_c10d.py (#781) (#88471)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/781

Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.

I have relinked this diff to a new github PR, so that I can update it. The original PR is
> Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348

Differential Revision: D40236769

Pulled By: yhcharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88471
Approved by: https://github.com/gnadathur, https://github.com/rohan-varma
2022-11-07 17:56:40 +00:00
fc8f2f66fe Clarify rules for which commit is used in CI (#88425)
The old information was out of date.  Updating it as per @janeyx99's feedback

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88425
Approved by: https://github.com/malfet
2022-11-07 17:38:42 +00:00
c407a7b203 Upgrade Linux NVIDIA driver to the latest prod version (#88517)
The driver (515.76) is downloaded from https://www.nvidia.com/en-us/drivers/unix. This should help address the issue with A10G GPU on G5 runners according to NVIDIA. This is to address https://github.com/pytorch/pytorch/issues/88352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88517
Approved by: https://github.com/ZainRizvi
2022-11-07 17:26:28 +00:00
505486ce93 Use sudo when reset NVIDIA devices (#88531)
Per title, I should have known, i.e. https://ossci-raw-job-status.s3.amazonaws.com/log/9307292415

```
2022-11-04T23:52:18.2921665Z + echo 1
2022-11-04T23:52:18.2921862Z Reseting 0000:00:1e.0 (enabled state: 0)
2022-11-04T23:52:18.2922186Z .github/scripts/install_nvidia_utils_linux.sh: line 77: /sys/bus/pci/devices/0000:00:1e.0/reset: Permission denied
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88531
Approved by: https://github.com/ZainRizvi
2022-11-07 17:19:02 +00:00
cec4bd99b0 allow XLA folks update the pin (#88527)
this is one of the files XLA team needs to update ocassionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88527
Approved by: https://github.com/wconstab
2022-11-07 17:02:08 +00:00
a16ced03c9 reland "fix as_strided_scatter_backward (#87646)" (#88342)
This reverts commit 71fb763e5452881cb3be8fefa9419b785d0a61e2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88342
Approved by: https://github.com/zou3519
2022-11-07 15:00:58 +00:00
dd43903fa9 [Static Runtime] Fix tensor_split sections overload (#88113)
Summary:
D40798763 broke this op. Unfortunately, it wasn't caught at land time due to the recent OSS Static Runtime test problems.

The problem is C++ overload resolution. After D40798763, the int that we were passing to `at::native::tensor_split` was getting implicitly converted to `IntArrayRef`. Fix this by converting the int to a `SymInt` and calling the correct overload.

Test Plan:
```
buck2 test caffe2/benchmarks/static_runtime:static_runtime_cpptest -- Tensor_Split --run-disabled
```

Differential Revision: D40862394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88113
Approved by: https://github.com/hlu1
2022-11-07 14:36:39 +00:00
7076a6481d [xla hash update] update the pinned xla hash (#88070)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88070
Approved by: https://github.com/pytorchbot
2022-11-07 10:22:46 +00:00
ad27d762a7 Support sign for HF models like ElectraForQuestionAnswering (#88160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88160
Approved by: https://github.com/jansel
2022-11-07 09:10:37 +00:00
a9d37ce8f5 Support reduction vectorization (#87356)
This PR is to optimize reduction implementation by `at::vec`. The main idea is as same as the aten implementation.
- Step1: Parallelize and vectorize the reduction implementation
- Step2: Invoke `at::vec::vec_reduce_all` to reduce the vector generated at step 1 to a single scalar
- Step3: Handle the tail elements

For the implementation, we create two kernels - `CppVecKernel` and `CppKernel`. The code block generation is as follows step by step.

- Gen the non-reduction loop - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1008-L1010)
- Gen the reduction initialization both for vectorization and non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1015)
- Gen the reduction loop for the vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1021-L1023)
- Gen the code to reduce the vector to scalar - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1033)
- Gen the reduction loop for the non-vectorization kernel - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1042)
- Do some post-reduction things like store reduction value - [Code](https://github.com/pytorch/pytorch/blob/gh/EikanWang/9/head/torch/_inductor/codegen/cpp.py#L1049)

```python
# Gen the non-reduction loop
for loop in CppVecKernel.NoneReductionLoop:
    # Gen the reduction initialization both for vectorization and non-vectorization kernel
    CppVecKernel.ReductionPrefix
    # Gen the reduction loop for the vectorization kernel
    for loop in CppVecKernel.ReductionLoop
        CppVecKernel.Loads
        CppVecKernel.Compute
        CppVecKernel.Stores
    # Gen the code to reduce the vector to scalar
    CppVecKernel.ReductionSuffix
    # Gen the reduction loop for the non-vectorization kernel
    for loop in CppKernel.ReductionLoop
        CppKernel.Loads
        CppKernel.Compute
        CppKernel.Stores
    # The reduction is almost finished. To do some post-reduction things like store reduction value.
    CppKernel.ReductionSuffix
```
The code snippet for maximum reduction exemplifies the idea. More detailed comments are inlined.

```C++
    {
        // Declare reduction for at::vec::Vectorized since it is not built-in data type.
        #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})

        float tmp4 = 0;
        // tmp4_vec is used to vectorize the sum reduction for tmp4
        auto tmp4_vec = at::vec::Vectorized<float>(tmp4);
        float tmp6 = 0;
        // tmp6_vec is used to vectorize the sum reduction for tmp6
        auto tmp6_vec = at::vec::Vectorized<float>(tmp6);
        #pragma omp parallel num_threads(48)
        {
            // Parallelize the vectorized reduction
            #pragma omp for reduction(+:tmp4_vec) reduction(+:tmp6_vec)
            for(long i0=0; i0<192; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
                auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
                auto tmp2 = tmp0 - tmp1;
                auto tmp3 = tmp2.abs();
                auto tmp5 = tmp2 * tmp2;
                tmp4_vec += tmp3;
                tmp6_vec += tmp5;
            }
            // Reduce the tmp4_vec as a scalar and store at tmp4
            tmp4 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp4_vec);
            // Reduce the tmp6_vec as a scalar and store at tmp6
            tmp6 = at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp6_vec);
            // Handle the tail elements that could not be vectorized by aten.
            #pragma omp for simd simdlen(4) reduction(+:tmp4) reduction(+:tmp6)
            for(long i0=1536; i0<1536; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = in_ptr1[i0];
                auto tmp2 = tmp0 - tmp1;
                auto tmp3 = std::abs(tmp2);
                auto tmp5 = tmp2 * tmp2;
                tmp4 += tmp3;
                tmp6 += tmp5;
            }
        }
        out_ptr0[0] = tmp4;
        out_ptr1[0] = tmp6;
    }
```

Performance(Measured by operatorbench and the base line of speedup ratio is aten operator performance):
Softmax (1,16,384,384,dim=3) | Speedup ratio (simdlen=None) |  Speedup ratio (simdlen=8) + this PR
-- | -- | --
24c | 0.37410838067524177 | 0.9036240100351164
4c | 0.24655829520907663 | 1.0255329993674518
1c | 0.21595768114988007 | 1.000587368005134

HW Configuration:
SKU: SKX Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
MemTotal:       196708148 kB
MemFree:        89318532 kB
MemBandwidth:  112195.1MB/S

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87356
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:40:34 +00:00
6541e51ffd Explicit vectorization support for TorchInductor (#87068)
In this PR, we replace OMP SIMD with `aten::vec` to optimize TorchInductor vectorization performance. Take `res=torch.exp(torch.add(x, y))` as the example. The generated code is as follows if `config.cpp.simdlen` is 8.

```C++
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       const float* __restrict__ in_ptr1,
                       float* __restrict__ out_ptr0,
                       const long ks0,
                       const long ks1)
{
    #pragma omp parallel num_threads(48)
    {
        #pragma omp for
        for(long i0=0; i0<((ks0*ks1) / 8); ++i0)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 8*i0);
            auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + 8*i0);
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = tmp2.exp();
            tmp3.store(out_ptr0 + 8*i0);
        }
        #pragma omp for simd simdlen(4)
        for(long i0=8*(((ks0*ks1) / 8)); i0<ks0*ks1; ++i0)
        {
            auto tmp0 = in_ptr0[i0];
            auto tmp1 = in_ptr1[i0];
            auto tmp2 = tmp0 + tmp1;
            auto tmp3 = std::exp(tmp2);
            out_ptr0[i0] = tmp3;
        }
    }
}

```

The major pipeline is as follows.
- Check whether the loop body could be vectorized by `aten::vec`. The checker consists of two parts. [One ](bf66991fc4/torch/_inductor/codegen/cpp.py (L702))is to check whether all the `ops` have been supported. The [other one](355326faa3/torch/_inductor/codegen/cpp.py (L672)) is to check whether the data access could be vectorized.
  - [`CppSimdVecKernelChecker`](355326faa3/torch/_inductor/codegen/cpp.py (L655))
- Create the `aten::vec` kernel and original omp simd kernel. Regarding the original omp simd kernel, it serves for the tail loop when the loop is vectorized.
  - [`CppSimdVecKernel`](355326faa3/torch/_inductor/codegen/cpp.py (L601))
  - [`CppSimdVecOverrides`](355326faa3/torch/_inductor/codegen/cpp.py (L159)): The ops that we have supported on the top of `aten::vec`
  - Create kernel
    - [`aten::vec` kernel](355326faa3/torch/_inductor/codegen/cpp.py (L924))
    - [`Original CPP kernel - OMP SIMD`](355326faa3/torch/_inductor/codegen/cpp.py (L929))
- Generate code
  - [`CppKernelProxy`](355326faa3/torch/_inductor/codegen/cpp.py (L753)) is used to combine the `aten::vec` kernel and original cpp kernel
    - [Vectorize the most inner loop](355326faa3/torch/_inductor/codegen/cpp.py (L753))
    - [Generate code](355326faa3/torch/_inductor/codegen/cpp.py (L821))

Next steps:
- [x] Support reduction
- [x] Vectorize the tail loop with `aten::vec`
- [ ] Support BF16
- [ ] Optimize the loop condition and loop index calculation by replacing `div` with `add`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87068
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-07 06:24:14 +00:00
a95419b47e use faster cache flush in triton benchmarking (#88557)
Speeds up autotuning a little bit more (about 90s -> 75s for coat_lite_mini)

@bertmaher, I've put in workaround so that internal doesn't break, but it can be removed once triton is updated internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88557
Approved by: https://github.com/anijain2305
2022-11-07 05:48:22 +00:00
eda247ee6c [Dynamo] fix torchdynamo's TVM meta schedule backend (#88249)
Note that the previous `optimize_torch` functionality of pytorch is not working with default pytorch release with  CXX11 ABI off as TVM by default needs CXX11 ABI for builds. Source: [1](https://discuss.tvm.apache.org/t/can-someone-please-give-me-the-steps-to-use-pt-tvmdsoop/12525), [2](https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627). It would be easier for user to tune with meta schedule instead of finding a CXX11-compatible pytorch, turning on the `pt-tvmdsoop` flag in TVM and rebuilding it. This could be revisited once the `pt-tvmdsoop` flag is updated and tuned on by default in TVM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88249
Approved by: https://github.com/jansel
2022-11-07 01:33:57 +00:00
791d9ee253 [inductor] Add lowering for as_strided_scatter (#88379)
Ref pytorch/torchdynamo#327

The use of as_strided does require in-memory manipulations, however this
 lowering allows those memory ops to be fused with any preceding calculations.
e.g.

```
def f(a, b):
    return torch.as_strided_scatter(
        a * 8 + 10,
        b * 2 - 4,
        size=(a.numel() // 2,),
        stride=(2,))
```

Before this compiles to two kernels and a call to `aten.as_strided_scatter` and
with this PR it compiles to just two kernels and no additional operator calls.

In theory I think this could be a decomposition, but in practice I saw the
`output_view.copy_(src)` being optimized out in some cases when this was
implemented as a decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88379
Approved by: https://github.com/jansel
2022-11-07 00:59:29 +00:00
81042d3a53 Revert "Reenable optimizer overlap tests (#88439)"
This reverts commit da452bcadbc6f34989c6b3b0db6075a272aa9891.

Reverted https://github.com/pytorch/pytorch/pull/88439 on behalf of https://github.com/huydhn due to This change breaks trunk due to a land race missing reason parameter to sandcastle_skip_if da452bcadb
2022-11-06 02:29:53 +00:00
bbaa0637df Add error inputs to gaussian_nll_loss OpInfo (#88486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88486
Approved by: https://github.com/lezcano
2022-11-05 20:10:54 +00:00
404f254e20 Upstream apply_optim_in_backward from TorchRec (#87397) (#88539)
Summary:

Upstreaming this as part of sharing common APIs. This is just a plain
move, any changes needed to support DDP / FSDP will come in follow up diffs.

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D40564646

fbshipit-source-id: 619c434e02196812f8d4db1e40d07290e08b18f9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88539
Approved by: https://github.com/awgu
2022-11-05 18:28:07 +00:00
da452bcadb Reenable optimizer overlap tests (#88439)
Closes https://github.com/pytorch/pytorch/issues/73259. Not sure the root cause but CI seems fine with these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88439
Approved by: https://github.com/awgu
2022-11-05 18:26:01 +00:00
d1ee073041 Handle case when candidate is empty (#88359)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88359
Approved by: https://github.com/wconstab
2022-11-05 17:19:40 +00:00
46730aec35 [Reland] Fix primTorch compute_elementwise_output_strides (#88525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88525
Approved by: https://github.com/desertfire
2022-11-05 05:42:07 +00:00
0e3031f7e7 Functionalize and compute joint simultaneously. (#88063)
This also comes with some bug fixes that were uncovered from doing
this:

- Forward device calls to inner tensor in FunctionalTensorWrapper

- Make legacyExtractDispatchKey exclude Functionalize, so that
  it can get at the real device type key.  This is noncontroversial.

- Stop stripping dense from key set.  The reason for this is
  FunctionalWrapperTensor may be used in contexts where people
  query if it is dense or not.  If it doesn't report this correctly
  (from the dispatch key), it will cause errors.  This caused some
  torchbench models to fail when I did one-pass tracing.

- Save and restore reapply views TLS correctly

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88063
Approved by: https://github.com/bdhirsh
2022-11-05 03:52:40 +00:00
957a9b63c5 fx.replace_pattern accepts pattern/replacement as GraphModule (#88479)
Symbolic tracer is no longer the default tracer to produce fx graph.
SubgraphRewriter should thus accept a raw GraphModule, rather than use symbolic tracer by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88479
Approved by: https://github.com/jerryzh168
2022-11-05 03:35:30 +00:00
4bb5c2c205 Add docstring to DDPOptimizer (#88521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88521
Approved by: https://github.com/aazzolini
2022-11-05 02:41:26 +00:00
1f32c3c087 Add single-process DDP accuracy support to dynamo benchmark suite (#88511)
- does not intend to support multi-process, as that is more complex
  and we have torchbench scripts for that
- currently only works in accuracy mode as this was the main goal,
  but could be extended for measuring single-gpu perf impact of
  graph breaks

Run with

`python benchmarks/dynamo/torchbench.py --inductor --training --accuracy --only hf_Bert --ddp`

Example output
```
cuda train hf_Bert
[2022-11-04 18:52:08,304] torch._inductor.compile_fx: [WARNING] skipping cudagraphs due to complex input striding
PASS
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88511
Approved by: https://github.com/davidberard98, https://github.com/aazzolini
2022-11-05 02:41:17 +00:00
3fd0729bb6 DDPOptimizer replace debug=True/False with using torchdynamo logger (#88480)
Example output:

```
2022-11-04 05:09:29,525] torch._dynamo.optimizations.distributed: [INFO]
DDPOptimizer bucket assignments
┌─────────┬────────────┬───────────────────┐
│   Index │   Size (b) │ Param Names       │
├─────────┼────────────┼───────────────────┤
│       0 │  100120020 │ self_net_6_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_6_bias   │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_4_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_4_bias   │
├─────────┼────────────┼───────────────────┤
│       1 │  100020000 │ self_net_2_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_2_bias   │
├─────────┼────────────┼───────────────────┤
│       2 │     220000 │ self_net_0_weight │
├─────────┼────────────┼───────────────────┤
│         │            │ self_net_0_bias   │
└─────────┴────────────┴───────────────────┘
[2022-11-04 05:09:29,527] torch._dynamo.optimizations.distributed: [DEBUG]
---orig graph---
graph():
    %inputs : torch.Tensor [#users=1] = placeholder[target=inputs]
    %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {})
    %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {})
    %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {})
    %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {})
    %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {})
    %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {})
    %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {})
    %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {})
    return (self_net_7,)

---split graph---
graph():
    %inputs : torch.Tensor [#users=1] = placeholder[target=inputs]
    %submod_0 : [#users=1] = call_module[target=submod_0](args = (%inputs,), kwargs = {})
    %submod_1 : [#users=1] = call_module[target=submod_1](args = (%submod_0,), kwargs = {})
    %submod_2 : [#users=1] = call_module[target=submod_2](args = (%submod_1,), kwargs = {})
    return (submod_2,)

---submod_0 graph---
graph():
    %inputs : [#users=1] = placeholder[target=inputs]
    %self_net_0 : [#users=1] = call_module[target=self_net_0](args = (%inputs,), kwargs = {})
    %self_net_1 : [#users=1] = call_module[target=self_net_1](args = (%self_net_0,), kwargs = {})
    return self_net_1

---submod_1 graph---
graph():
    %self_net_1 : [#users=1] = placeholder[target=self_net_1]
    %self_net_2 : [#users=1] = call_module[target=self_net_2](args = (%self_net_1,), kwargs = {})
    %self_net_3 : [#users=1] = call_module[target=self_net_3](args = (%self_net_2,), kwargs = {})
    return self_net_3

---submod_2 graph---
graph():
    %self_net_3 : [#users=1] = placeholder[target=self_net_3]
    %self_net_4 : [#users=1] = call_module[target=self_net_4](args = (%self_net_3,), kwargs = {})
    %self_net_5 : [#users=1] = call_module[target=self_net_5](args = (%self_net_4,), kwargs = {})
    %self_net_6 : [#users=1] = call_module[target=self_net_6](args = (%self_net_5,), kwargs = {})
    %self_net_7 : [#users=1] = call_module[target=self_net_7](args = (%self_net_6,), kwargs = {})
    return self_net_7

---------------
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88480
Approved by: https://github.com/anj-s, https://github.com/davidberard98
2022-11-05 02:40:51 +00:00
52375a0fd2 nvprims native batch norm patch (#88455)
Cherry-picking: https://github.com/csarofeen/pytorch/pull/2104

- [x] Added explicit cast on inputs to nvprims.native_batch_norm. This avoids the explicit cast, which gives us issue on fusion definition.
- [x] add python repro with dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88455
Approved by: https://github.com/mruberry, https://github.com/IvanYashchuk
2022-11-05 02:22:27 +00:00
b1116a5117 [Dynamo] Improve BuiltinVariable log when incorrect arg count happens (#88409)
Fixes https://github.com/pytorch/torchdynamo/issues/1832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88409
Approved by: https://github.com/mlazos
2022-11-05 00:17:18 +00:00
5220d07d2c Fix minifier accuracy msg (#88515)
Fixes https://github.com/pytorch/torchdynamo/issues/1809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88515
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
2022-11-04 23:26:44 +00:00
dde9affeaa Populate self.export in InstructionTranslatorBase (#88508)
Summary:

This is a followup to https://github.com/pytorch/pytorch/pull/88354/files#diff-622913fdb49db90d6f3a8ab225b4badb7996023e6498e9f7c6d03fe9f32d0986R836

Reference to self.export got added to InstructionTranslatorBase (i.e. STORE_ATTR) but self.export is populated only for InstructionTranslators.

Here's an example failure

```
   File "/scratch/williamwen/work/pytorch/torch/_dynamo/symbolic_convert.py", line 322, in step
    getattr(self, inst.opname)(inst)
  File "/scratch/williamwen/work/pytorch/torch/_dynamo/symbolic_convert.py", line 844, in STORE_ATTR
    not self.export
AttributeError: 'InliningInstructionTranslator' object has no attribute 'export'
```

Let's populate with the base class with export flag.

Test Plan:

python test/dynamo/test_export_mutations.py
python test/dynamo/test_export.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88508
Approved by: https://github.com/tugsbayasgalan
2022-11-04 23:23:41 +00:00
afdc2283ef [QNNPACK] Add unaligned attributes where asan fails (#88276)
Summary: Bypass "Runtime error: store to misaligned address [...] for type 'uint16_t' (aka 'unsigned short'), which requires 2 byte alignment"

Test Plan:
One of the failing tests, now passes
`buck test fbsource//arvr/mode/platform010/dev-asan fbsource//arvr/libraries/eye/engine:sys_test_eyetrackingenginevisioninterface`

Reviewed By: kimishpatel, salilsdesai

Differential Revision: D40918376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88276
Approved by: https://github.com/manuelcandales
2022-11-04 23:01:45 +00:00
7560a7b27c [Quant] Respect non_leaf_module_list for activation modules (#88498)
Summary: This commit fixes the bug where `non_leaf_module_list`
was not respected for activation modules like `torch.nn.Sigmoid`
and `torch.nn.Tanh`. Today, these modules default to
`default_fixed_qparams_range_0to1_fake_quant`, and there is no
way to configure them to use any other activation_post_process
(e.g. FixedQParamsObserver) (see this [mapping](dc00bb51b8/torch/ao/quantization/quantization_mappings.py (L188-L193))).
`non_leaf_module_list` is a "list of non-leaf modules we want
to add observer" (see prepare docstring). If the user explicitly
specified to insert observers for these modules, we should respect
that instead of continuing to use the default.

Test Plan:
python test/test_quantization.py TestQuantizeEagerPTQStatic.test_activations_in_non_leaf_module_list

Reviewers: vkuzo, jerryzh168

Subscribers: vkuzo, jerryzh168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88498
Approved by: https://github.com/jerryzh168
2022-11-04 22:46:55 +00:00
5af3feefab [BE] Update native_functions.yaml README; we do not support Tensor! (#88513)
Just a doc update to minimize confusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88513
Approved by: https://github.com/bdhirsh
2022-11-04 21:48:29 +00:00
678d038001 Support DDP ignored parameters in DDPOptimizer (#88460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88460
Approved by: https://github.com/aazzolini
2022-11-04 21:42:15 +00:00
ff6770a9a1 enable backward for log1p (sparse layouts) (#88155)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88155
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:26 +00:00
6938dd0b2c Support sparse inputs to deg2rad (#88156)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88156
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:26 +00:00
1964d8c34f Enable sparse_csr autograd testing for relu (#88154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88154
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:23 +00:00
f03302ba49 Add sparse layout support for torch.frac (#88153)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88153
Approved by: https://github.com/cpuhrsch
2022-11-04 20:59:22 +00:00
d632d94cc7 Disable mem leak check (#88373)
tbh at this point it might be easier to make a new workflow and copy the relevant jobs...

Changes:
* Disable cuda mem leak check except for on scheduled workflows
* Make pull and trunk run on a schedule which will run the memory leak check
* Periodic will always run the memory leak check -> periodic does not have parallelization anymore
* Concurrency check changed to be slightly more generous
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88373
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2022-11-04 20:47:42 +00:00
093e220836 Re-enable inductor models tests as periodical jobs (#88509)
Run every 4 hour same as periodic, but offset by an hour. This should give us some signals instead of completely disabling these jobs on master after https://github.com/pytorch/pytorch/pull/88374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88509
Approved by: https://github.com/malfet
2022-11-04 20:35:13 +00:00
3e6579b8f6 Don't print fatal:... in generate_torch_version.py (#88335)
During build, users commonly see a message like
```
fatal: no tag exactly matches 'd8b4f33324b1eb6c1103874764116fb68e0d0af4'
```
which is usually ignored when builds succeed, but has confused users when build fails (due to a different issue). This PR removes the red herring, since this usually prints for local development when tags are not found.

We catch the exception anyway and handle it under the hood, so we don't need to print it and confuse the user.

Test plan:
Note that builds on trunk current have this line, cmd-F 'fatal: no tag exactly matches' in https://github.com/pytorch/pytorch/actions/runs/3379162092/jobs/5610355820.

Then check in the PR build to see that the line no longer appears.

I also tagged my commit locally and printed what tag would be--this code and the old code printed the same results for what tag would be.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88335
Approved by: https://github.com/seemethere
2022-11-04 20:34:23 +00:00
955cbe610b [inductor] Handle the case where kwargs contains tensor (#88417)
Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
currently inductor does not allow any tensor in kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88417
Approved by: https://github.com/ngimel
2022-11-04 20:29:03 +00:00
e940a2f8e2 Add nondeterministic error for scatter (#88244)
Fixes #88096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88244
Approved by: https://github.com/ezyang, https://github.com/mruberry
2022-11-04 20:23:59 +00:00
6575174dcb [fx2ait] fixes for AITSplitter (#87805)
Summary: propagate lower settings to AITSplitter settings.

Reviewed By: yinghai, qxy11

Differential Revision: D40568216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87805
Approved by: https://github.com/yinghai
2022-11-04 20:18:08 +00:00
7b419e8513 [NVFuser] Upstream push 1026 (#87779)
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

* codegen improvement:
    i. allow non-root trivial reductions, allow empty/no-op fusion
    ii. fixes vectorization checks and size calculation
    iii. bank conflict handle improvement
    iv. enables transpose scheduler

* misc:
    i. CI tests failure fixes
    ii. cpp tests file clean up
    iii. trivial forwarding supports added in codegen runtime
    iv. added factory methods support in codegen

Commits that's in this PR from the devel branch:

```
7117a7e37ebec372d9e802fdfb8abb7786960f4a patching nvfuser conv cudnn test numerics mismatch (#2048)
65af1a4e7013f070df1ba33701f2d524de79d096 Inserting sync for redundant parallel types is already done at the (#2023)
6ac74d181689c8f135f60bfc1ec139d88941c98c Fix sync map (#2047)
f5bca333355e2c0033523f3402de5b8aac602c00 Bank conflict checker improvements (#2032)
d2ca7e3fd203537946be3f7b435303c60fa7f51e Minor update on cp.async code generation. (#1901)
d36cf61f5570c9c992a748126287c4e7432228e0 Test file cleanup (#2040)
0b8e83f49c2ea9f04a4aad5061c1e7f4268474c6 Allow non-root trivial reductions (#2037)
a2dfe40b27cd3f5c04207596f0a1818fbd5e5439 Fix vectorize size calculation (#2035)
e040676a317fe34ea5875276270c7be88f6eaa56 Use withPredicate to replace setPredicate to maintain Exprs immutable (#2025)
197221b847ad5eb347d7ec1cf2706733aacbf97c removing ci workflow (#2034)
40e2703d00795526e7855860aa00b9ab7160755f Reduction rand like patch (#2031)
bc772661cbdb3b711d8e9854ae9b8b7052e3e4a3 Add utility for checking bank conflict of shared memory (#2029)
ddd1cf7695f3fb172a0e4bcb8e4004573617a037 Add back FusionReductionWithTrivialReduction_CUDA (#2030)
fbd97e5ef15fa0f7573800e6fbb5743463fd9e57 Revert "Cleanup trivial reduction workarounds (#2006)" (#2024)
bca20c1dfb8aa8d881fc7973e7579ce82bc6a894 Cleanup trivial reduction workarounds (#2006)
e4b65850eee1d70084105bb6e1f290651adde23e Trivial forwarding (#1995)
1a0e355b5027ed0df501989194ee8f2be3fdd37a Fix contiguity analysis of predicates to match updated contiguity. (#1991)
a4effa6a5f7066647519dc56e854f4c8a2efd2a7 Enable output allocation cache (#2010)
35440b7953ed8da164a5fb28f87d7fd760ac5e00 Patching bn inference (#2016)
0f9f0b4060dc8ca18dc65779cfd7e0776b6b38e8 Add matmul benchmark (#2007)
45045cd05ea268f510587321dbcc8d7c2977cdab Enable tests previously disabled due to an aliasing bug (#2005)
967aa77d2c8e360c7c01587522eec1c1d377c87e Contiguous indexing for View operations (#1990)
a43cb20f48943595894e345865bc1eabf58a5b48 Make inlining even more modular (#2004)
dc458358c0ac91dfaf4e6655a9b3fc206fc0c897 Test util cleanup (#2003)
3ca21ebe4d213f0070ffdfa4ae5d7f6cb0b8e870 More strict validation (#2000)
a7a7d573310c4707a9f381831d3114210461af01 Fix build problem (#1999)
fc235b064e27921fa9d6dbb9dc7055e5bae1c222 Just fixes comments (#1998)
482386c0509fee6edb2964c5ae72074791f3e43a cleanup (#1997)
4cbe0db6558a82c3097d281eec9c85ad2ea0893a Improve divisible split detection (#1970)
42ccc52bdc18bab0330f4b93ed1399164e2980c9 Minor build fix. (#1996)
fcf8c091f72d46f3055975a35afd06263324ede6 Cleanup of lower_utils.cpp: Isolate out GpuLower usage (#1989)
15f2f6dba8cbf408ec93c344767c1862c30f7ecc Move ConcretizedBroadcastDomains to shared_ptr in GpuLower. (#1988)
8f1c7f52679a3ad6acfd419d28a2f4be4a7d89e2 Minor cleanup lower_unroll.cpp (#1994)
1d9858c80319ca7f0037db7de5f04e47f540d76c Minor cleanup (#1992)
f262d9cab59f41c669f53799c6d4a6b9fc4267eb Add support for uniform RNG (#1986)
eb1dad10c73f855eb1ecb20a8b1f7b6edb0c9ea3 Remove non-const functions, remove GpuLower instance on build, pass in ca_map. (#1987)
634820c5e3586c0fe44132c51179b3155be18072 Add support for some empty fusion (#1981)
eabe8d844ad765ee4973faa4821d451ef71b83c3 Segment self mapping fusions (#1954)
e96aacfd9cf9b3c6d08f120282762489bdf540c8 Enable Transpose operation (#1882)
425dce2777420248e9f08893765b5402644f4161 Add a null scheduler that helps segmenting away no-op schedules (#1835)
306d4a68f127dd1b854b749855e48ba23444ba60 Fix canScheduleCompileTime check of transpose scheduler (#1969)
b1bd32cc1b2ae7bbd44701477bddbcfa6642a9be Minor fix (#1967)
bd93578143c1763c1e00ba613a017f8130a6b989 Enable transpose scheduler (#1927)
b7a206e93b4ac823c791c87f12859cf7af264a4c Move scheduler vectorize utilities into their own file (#1959)
d9420e4ca090489bf210e68e9912bb059b895baf View scheduling (#1928)
c668e13aea0cf21d40f95b48e0163b812712cdf2 Upstream push ci fixes (#1965)
c40202bb40ce955955bb97b12762ef3b6b612997 Fix dump effective bandwidth (#1962)
93505bcbb90a7849bd67090fe5708d867e8909e4 WAR on index mapping when exact and permissive maps differ (#1960)
45e95fd1d3c773ee9b2a21d79624c279d269da9f Allow splitting inner-most ID to create virtual innermost ID in transpose scheduler (#1930)
a3ecb339442131f87842eb56955e4f17c544e99f Improve the comments at the beginning of index_compute.h (#1946)
f7bc3417cc2923a635042cc6cc361b2f344248d6 Remove unused variables (#1955)
df3393adbb5cb0309d091f358cfa98706bd4d313 Some cleanup (#1957)
7d1d7c8724ab5a226fad0f5a80feeac04975a496 TVDomainGuard factory (#1953)
357ba224c0fb41ed3e4e8594d95599c973f4a0ca Fill allocation with nan on tests (#1956)
8eafc54685d406f5ac527bcbacc475fda4492d7a Fix detection of unmappable root domains (#1952)
90a51f282601ba8ebd4c84b9334efd7762a234bc Some indexing cleanups, Add eye support (#1940)
ddc01e4e16428aec92f9c84d698f959b6436a971 Exclude unsupported data types (#1951)
992e17c0688fe690c51b50e81a75803621b7e6aa test the groups the same order as they are merged (#1949)
208262b75d1fed0597a0329d61d57bc8bcd7ff14 Move detection of self mapping IDs to IterDomainGraph from (#1941)
ac4de38c6ee53b366e85fdfe408c3642d32b57df Merge pull request #1945 from csarofeen/master_merge_0828
631094891a96f715d8c9925fb73d41013ca7f2e3 Add full, full_like, zeros, zeros_like, ones, ones_like (#1943)
aab10bce4541204c46b91ff0f0ed9878aec1bfc4 Merge remote-tracking branch 'upstream/viable/strict' into HEAD
4c254c063bb55887b45677e3812357556a7aa80d Fix arange when step is negative (#1942)
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D40869846](https://our.internmc.facebook.com/intern/diff/D40869846)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87779
Approved by: https://github.com/davidberard98
2022-11-04 20:04:34 +00:00
15e54293ef [MPS] Fix embedding backward with scalar index (#82809)
### Description
Previously the embedding backward always expands `-1` dim to indices, resulting in the following error when the indices is a scalar:

```
 error: Rank of data array must equal number of outer dimensions in indices array + rank of slice to update, 2 != 1 + 0
-:8:10: note: see current operation: %5 = "mps.scatter_nd"(%0, %arg1, %4) {batch_dims = 0 : ui32, mode = 0 : i32} : (tensor<10x5xf16>,
```

Now makes it conditional.

Reproducer:

```python
def repro():
    w = torch.tensor([[-2.6465,  2.5859,  0.4688,  1.7949,  3.2676],
        [-3.1641,  8.9375,  5.7578, -2.9453, -6.5469],
        [ 2.0469,  1.3516, -8.7344,  6.0000,  1.3906],
        [ 6.5781,  7.8438,  6.9766,  3.2891, -5.1172],
        [-7.9414,  7.7344,  4.1875,  2.8574,  2.9531],
        [-0.4844, -5.6328, -6.8359, -4.5156,  3.7891],
        [ 4.9375,  6.6094,  6.7031,  0.6719, -6.4219],
        [ 7.0469,  8.2031,  4.4453,  1.7129, -2.4688],
        [ 1.2207, -3.3750, -2.4531,  7.4062, -6.0469],
        [-8.9688,  2.2656,  2.4160, -1.0176,  8.4531]], dtype=torch.float32, requires_grad=True)
    x = torch.tensor(5)
    out = torch.nn.functional.embedding(x, w)
    out.sum().backward()

    w_mps = w.detach().clone().to("mps").requires_grad_()
    x_mps = x.to("mps")
    out = torch.nn.functional.embedding(x_mps, w_mps)
    out.sum().backward() # error
```

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82809
Approved by: https://github.com/malfet
2022-11-04 19:43:56 +00:00
5b767d404e Modified roundup_power2_divisions to specify the number of divisions for each power of two interval (#87290)
Summary:
Improved roundup_power2_divisions knob so it allows better control of rouding in the PyTorch CUDA Caching Allocator.

This new version allows setting the number of divisions per power of two interval starting from 1MB and ending at 64GB and above. An example use case is when rouding is desirable for small allocations but there are also very large allocations which are persistent, thus would not benefit from rounding and take up extra space.

Test Plan: Tested locally

Differential Revision: D40103909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87290
Approved by: https://github.com/zdevito
2022-11-04 19:31:16 +00:00
b78b8727ff [vulkan] enable prepacking for Batchnorm op (#88433)
Adds a `BatchNormPackedContext` so that the `batchnorm` op can use prepacking.

Differential Revision: [D40721546](https://our.internmc.facebook.com/intern/diff/D40721546/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88433
Approved by: https://github.com/manuelcandales
2022-11-04 19:24:13 +00:00
53eac1d482 Revert "Revert "Put Python Dispatcher cache in dict, clear it on new registrations. (#88329)"" (#88489)
The bug was that I was accidentally caching at the wrong key name, so
we were never actually hitting the cache.  I've renamed the resolved
key to final_key to avoid shadowing in this way.

This reverts commit 410ce96a23a3496a45478e0b25ffac53aa3c116f.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88489
Approved by: https://github.com/albanD
2022-11-04 19:23:04 +00:00
79abea5683 nvprim python runtime dtype correctness patch (#88452)
Cherry-picking: https://github.com/csarofeen/pytorch/pull/2133

- [x] casts FusionDefinition output to original dtype recorded in the GraphModule
- [x] add a python repro with dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88452
Approved by: https://github.com/IvanYashchuk, https://github.com/mruberry
2022-11-04 19:17:07 +00:00
8c1c6759b2 Revert "remove assert_allclose from torch.testing (#87974)"
This reverts commit 5669e10d37fa3cca21cf82c843ae4c4e79da1b89.

Reverted https://github.com/pytorch/pytorch/pull/87974 on behalf of https://github.com/mehtanirav due to Internal breakages from method removal
2022-11-04 19:12:37 +00:00
bda688c186 Fix typo in clones (#88501)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88501
Approved by: https://github.com/wconstab
2022-11-04 19:12:19 +00:00
633f0d620d [torch package] Treat builtins as default extern module (#88385)
Summary: When using torch deploy, if we do fx transformation and then try to pickle/unpickle a fx GraphModule, it's possible that the GraphModule's code depends on `builtins` but we didn't add it to extern module.

Reviewed By: PaliC

Differential Revision: D40958730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88385
Approved by: https://github.com/PaliC
2022-11-04 17:35:12 +00:00
ead36e5a90 Add dep on Accelerate framework to torch podspecs (#88422)
A dep on Accelerate was added in https://github.com/pytorch/pytorch/pull/80449 We need to declare this dep in our podspec, otherwise users will have to add the Accelerate framework to their projects manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88422
Approved by: https://github.com/kimishpatel, https://github.com/malfet
2022-11-04 17:31:17 +00:00
dc00bb51b8 [Vulkan][TCC] Add tests for conv2d prepack context (#88316)
Summary:
Implement Vulkan tests for the create/run context functions in Convolution.cpp, their transposed versions and their backwards compatible versions:
- create_conv2d_context
- run_conv2d_context
- create_tconv2d_context
- run_tconv2d_context
- conv2d_clamp_prepack
- conv2d_clamp_run

Test Plan:
On Mac
```
cd ~/fbsource
buck run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
```

Reviewed By: salilsdesai

Differential Revision: D40935343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88316
Approved by: https://github.com/salilsdesai
2022-11-04 12:07:12 +00:00
a171b0636a Add use_lazy_shape flag to GenLazyIr class (#88444)
Add use_lazy_shape flag to GenLazyIr class to allow XLA to use its custom shape class. The default value is kept to use lazy shape, so this PR does not introduce any new behaviors.

PyTorch/XLA companion PR: https://github.com/pytorch/xla/pull/4111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88444
Approved by: https://github.com/alanwaketan, https://github.com/wconstab
2022-11-04 08:23:56 +00:00
b3206268ac TorchDynamo: enable convolution and batchnorm folding for inference path (#87435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87435
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 05:24:57 +00:00
fbd08fb358 Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)
- Asserts for CUDA are enabled by default
- Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON`
- Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON`

This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-11-04 04:43:05 +00:00
70b00b1383 Add hf_bert + DDP multigpu test (#88435)
Spot-checks an e2e model working with ddp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88435
Approved by: https://github.com/davidberard98
2022-11-04 03:17:48 +00:00
71f793d312 TorchDynamo: Add linear binary fusion for cpu in BF16 inference mode (#87066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87066
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 02:40:29 +00:00
7d95b1e344 Run all fallback kernels with FakeTensor (#88248)
This improves the memory compression of resnet18 from .84 -> .94 on inductor no-cudagraphs. It does mean that any extern kernel which incorrectly computes strides will be a hard error at runtime, but that's an issue we are going to have to face with dynamic shapes anyway. CC @ezyang, @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88248
Approved by: https://github.com/ezyang
2022-11-04 02:06:38 +00:00
e4efea4f14 TorchDynamo: Add linear unary fusion for cpu in BF16 inference mode (#87065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87065
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 01:26:08 +00:00
657f2e12f0 [MPS] Add native cumsum implementation (#88319)
Using https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/4057333-cumulativesumwithtensor?language=objc

Fall back to CPU if running on older MacOS versions
In `unary_op` add output tensor dims/dtype to the graph key (as even in default op we check output graph type)
Also, upcast int16 to int32 as MPS cumsum op on Ventura returns incorrect results for Int16 type (and it makes total sense for int8, as chances for overflow are very high)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88319
Approved by: https://github.com/kulinseth
2022-11-04 01:22:41 +00:00
52173188ef TorchDynamo: Add convolution binary fusion for cpu in inference mode (#87064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87064
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-11-04 01:10:05 +00:00
2ce2fc133d Disable Current Modes when printing Tensor (#88344)
Fix for https://github.com/pytorch/pytorch/issues/88087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88344
Approved by: https://github.com/ezyang, https://github.com/samdow
2022-11-04 00:45:35 +00:00
e804c72294 [LTC] Update merge_rules.yaml (#88291)
Summary:
Some of the LTC code-gen infra has been moved from codegen/ to torchgen/. Update the merge_rules.yaml to reflect that.

Test Plan:
New GH PRs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88291
Approved by: https://github.com/malfet
2022-11-04 00:06:07 +00:00
a84d68cdfd [FSDP][Docs] Reword sharding_strategy docs and other minor doc changes (#88431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88431
Approved by: https://github.com/mrshenli
2022-11-03 23:32:41 +00:00
ff23e07b2e [FSDP][Docs] Simplify CPU offload docs (#88430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88430
Approved by: https://github.com/mrshenli
2022-11-03 23:32:41 +00:00
4de50b2521 [FSDP] Allow to use TorchDispatch with FSDP (#88014)
Add `_no_dispatch_record_stream` to disable TorchDispatch before calling `record_stream()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88014
Approved by: https://github.com/awgu
2022-11-03 23:15:56 +00:00
31ebd3cc2f Reset NVIDIA devices stuck in failed mode (#88459)
Try to reset the NVIDIA devices if they get stuck in failed mode per comment in https://github.com/pytorch/pytorch/issues/88388

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88459
Approved by: https://github.com/malfet
2022-11-03 23:15:41 +00:00
ab8f3333ff [FSDP][Docs] Simplify mixed_precision ctor docs (#88429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88429
Approved by: https://github.com/mrshenli
2022-11-03 23:15:32 +00:00
36582574f3 [dynamo] Skip mutation detection for inference mode (#88406)
Skip the mutation detection for inference_mode, and raise a warning. This helps one internal model

Related to https://github.com/pytorch/torchdynamo/issues/1768

@ezyang What do you think about this? The issue that Dynamo mutation detector uses version counter to detect mutation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88406
Approved by: https://github.com/ezyang
2022-11-03 22:56:05 +00:00
410ce96a23 Revert "Put Python Dispatcher cache in dict, clear it on new registrations. (#88329)"
This reverts commit 86c7cd287caeb23c227d97d283e58bc123294746.

Reverted https://github.com/pytorch/pytorch/pull/88329 on behalf of https://github.com/clee2000 due to test_decomp takes an extra 2 hours in some jobs, windows takes so long it times out
2022-11-03 21:57:19 +00:00
9946041a3e [functorch] make hessian docs actually use hessian function (#88451)
I was going through the hessian docs to find an example and noticed that these docs don't actually use the hessian function....
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88451
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2022-11-03 21:50:52 +00:00
ce961b3443 Dont hold onto references of saved tensors in backward (#88247)
This improves memory compression of resnet18 on inductor non-cudagraphs from .78 -> .0.84.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88247
Approved by: https://github.com/ezyang
2022-11-03 21:24:32 +00:00
65de9a2b81 Fix fuse_func method overwrite (#87791) (#88193)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87791

Fixing the interface so that the fuse_func is honored and not replaced but the default fuse_known_method.

Test Plan: Wait for sandcastle

Reviewed By: jerryzh168

Differential Revision: D40722395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88193
Approved by: https://github.com/jerryzh168
2022-11-03 20:32:54 +00:00
433746300d [pytorch] Expose EmbeddingPackedParamsBase::unpack to Python (#88362)
Summary:
User can't call `.unpack()` when they have a quantized Embedding layer because `&EmbeddingPackedParamsBase::unpack` was never exposed to Python through pybind.

This diff fixes that.

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D40606585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88362
Approved by: https://github.com/jerryzh168
2022-11-03 20:20:49 +00:00
23a6e15321 [ONNX] Remove the INT64_MAX magic numbers (#88341)
Remove the magic numbers in symbolic opsets and use a INT64_MAX  global instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88341
Approved by: https://github.com/BowenBao
2022-11-03 20:18:36 +00:00
6d7eee04b8 [FSDP] Default to BACKWARD_PRE (#88428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88428
Approved by: https://github.com/mrshenli
2022-11-03 20:16:15 +00:00
c28022d96c [profiler] Add an option initialize kineto profiler on start up (#87226) (#88020)
Summary:
# Initialize Kineto Profiler for on-demand profiling

## TLDR
Overall this patch enables initializing the kineto profiling library on start-up. This is guarded by an env variable that is described a bit more later. The kineto profiler is otherwise initialized lazily when pytorch profiler is invoked.

## Background
We are enabling on-demand profiling capability for pytorch. As users run large distributed training flows this will enable one to capture a pytorch profiler/GPU trace remotely, from outside the process. The kineto library and a monitoring daemon - dynolog- interact to achieve this.

Dynolog will be open sourced by end of October, and has been dogfooded on Meta AI Research cluster.
https://github.com/facebookincubator/dynolog

### How it works
Kineto library registers itself with the dynolog daemon running on the host over inter process communication
```
  | kineto  |   --> (ipcfabric)  --> | dynolog |
   * register()
   * poll for on-demand tracing configs()
```
This feature is currently enabled by setting the env variable `KINETO_USE_DAEMON`.  However, it only works if we initialize kineto, else the thread to talk to dynolog is not spun up.

Related PRs in kineto include
https://github.com/pytorch/kineto/pull/637
https://github.com/pytorch/kineto/pull/653

## TestPlan:
Build pytorch from source (need to set USE_LITE_INTERPRETER_PROFILER=OFF)

Run a simple linear model [example](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html).

### First run with the env variable set
```
export KINETO_CONFIG=/private/home/bcoutinho//libkineto.conf
export KINETO_USE_DAEMON=1
python3 /private/home/bcoutinho/linear_model.py
```
Output
```
INFO:2022-10-18 09:01:12 4169946:4169946 init.cpp:98] Registering daemon config loader
cuda:0
```
We can trigger a trace using the dynolog client tool
```
#> dyno gputrace --log-file /tmp/gpu_trace_test.json
response length = 147
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[4116844],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[4116844]}
Matched 1 processes
Trace output files will be written to:
    /tmp/gpu_trace_test_4116844.json
```

### Run without env variable.
```
 python3 ../../linear_model.py
cuda:0
99 1425.056884765625
10099 8.817168235778809
```

## Side effects to initialization

Currently the environment should guard users from picking this change up unless intended. The libkineto_init does setup CUPTI APIs and spins up a thread to read on-demand configurations. This should not be problematic, we can provide a more granular init in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87226

Reviewed By: chaekit

Differential Revision: D40558184

Pulled By: briancoutinho

fbshipit-source-id: afea7502b1d72201c00994c87fde63a35783f4d5

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88020
Approved by: https://github.com/chaekit
2022-11-03 20:08:16 +00:00
826b4a9c2d [coreml] delegate multiple outputs (#88345)
Summary:
https://www.internalfb.com/code/fbsource/[c0e4da0b5c7fff3b4e31e4611033c30cabdc6aef]/fbcode/caffe2/torch/csrc/jit/backends/backend_detail.cpp?lines=268-276

seems like the torchscript addition of
`$unpack, = self.__backend.execute( ... `

the comma after unpack forces the result of execute to have only one item. So for this fix now when the size of the outputs > 1, execute returns a List List of outputs (basically put the outputs in another list before putting it into the list we return)
```
[[output1, output2, output3, ...]]
```
instead of
```
[output1, output2, output3, ...]
```

Do we want to fix this in backend_detail? Or should we make the change in our delegate to accomadate the torchscript? Proposing this q here. Requesting cccclai, kimishpatel for approval here

Test Plan: unblocked models for chengxiangyin and models in pytorch playground all passing unit tests

Reviewed By: kimishpatel, cccclai

Differential Revision: D40328684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88345
Approved by: https://github.com/jmdetloff, https://github.com/Skylion007
2022-11-03 20:05:53 +00:00
9533fe9031 [pytorch][vulkan] Add bias storage type to template (#88324)
To enable buffer based use for bias as well, this diff adds storage type for
bias to template

Differential Revision: [D40689003](https://our.internmc.facebook.com/intern/diff/D40689003/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88324
Approved by: https://github.com/jmdetloff
2022-11-03 20:02:24 +00:00
893f8e3790 [PyTorch][Vulkan] Add template based codegen for shader generation (#88323)
We would like to be able to parameterize kernels such that a parameterized
algorithm can be implemented via templates. We can then profile performance of
a kernel with different parameter values. This enables us to determine what
parameters may work the best for a given kernel or a given device.

In this diff one such kernel added in 1x1 conv which parameters across size of
the tile being produced by each invocation.

Few other options for parameters can be:
- One can imagine dtype can also be a parameter such that we can do compute in
fp16 or int8/int16.
- Register blocking for input channels

Differential Revision: [D40280336](https://our.internmc.facebook.com/intern/diff/D40280336/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40280336/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88323
Approved by: https://github.com/jmdetloff
2022-11-03 19:51:51 +00:00
60925fcb7e Dont clone inputs if using fake tensor (#88208)
Not sure that this will really reduce memory use but it is an extraneous copy in our stack right now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88208
Approved by: https://github.com/anijain2305
2022-11-03 19:35:53 +00:00
192e806c26 [Pytorch][vulkan] Generate shader with parameters (#88322)
Parametsr such as tile size and weight type and format is embedded within the
shader code. This is used to generate ShaderInfo.

For now we will maintain both ShaderSrc and ShaderInfo so as to transition from
VK_KERNEL to VK_SHADER incremental. Otherwise we will have to switch multiple
of them at the same time.

Differential Revision: [D40280338](https://our.internmc.facebook.com/intern/diff/D40280338/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88322
Approved by: https://github.com/jmdetloff, https://github.com/mcr229
2022-11-03 19:33:41 +00:00
fe3a226d74 [minor] use set_default_dtype instead of try and finally (#88295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88295
Approved by: https://github.com/mruberry
2022-11-03 19:28:33 +00:00
f8b73340c8 [dashboard] Replace aot_nvfuser with nvprims_nvfuser (#88437)
@IvanYashchuk @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88437
Approved by: https://github.com/soumith
2022-11-03 19:07:03 +00:00
2bda2baad7 [Dynamo][Easy] Fix config.suppress_errors error log (#88402)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88402
Approved by: https://github.com/williamwen42
2022-11-03 18:03:36 +00:00
4d62ee1b36 Verbose exc printing fix (#88387)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88387
Approved by: https://github.com/tugsbayasgalan
2022-11-03 17:59:05 +00:00
0a274c4b6c [ONNX] Default runtime type checking to raising errors (#86555)
Default runtime type checking to raise by changing the default value to  `GLOBALS.runtime_type_check_state` into ERRORS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86555
Approved by: https://github.com/BowenBao
2022-11-03 17:41:48 +00:00
d70bc222d8 add parameters check for mkldnn_transpose (#85318)
This PR is about add parameters check for mkldnn_transpose, fixed https://github.com/pytorch/pytorch/issues/85216.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85318
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/leslie-fang-intel
2022-11-03 17:28:33 +00:00
c1dd13fb2f [dynamo] Support compare op for userfunctionvariable (#88372)
Helps reduce graph breaks for one of the training models

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88372
Approved by: https://github.com/jansel
2022-11-03 17:05:50 +00:00
2c46d5725e Disallow module attribute mutation (#88354)
Summary:

See https://github.com/pytorch/torchdynamo/issues/1475

Not allowing any new mutations happen inside forward() function during
export.

Test Plan:

Run `python test/dynamo/test_export.py` and make sure it passes

Added new unit tests (3 positive tests and 4 negative tests)

Here's what the actual error looks like

```
  File "/home/mnachin/local/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 322, in step
    getattr(self, inst.opname)(inst)
  File "/home/mnachin/local/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 835, in STORE_ATTR
    assert not self.export, f"Mutating module attribute {inst.argval} during export."
AssertionError: Mutating module attribute a during export.

from user code:
   File "/data/users/mnachin/pytorch/test/dynamo/test_export_mutations.py", line 25, in forward
    self.a = self.a.to(torch.float64)

Set torch._dynamo.config.verbose=True for more information
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88354
Approved by: https://github.com/tugsbayasgalan, https://github.com/jansel
2022-11-03 17:01:22 +00:00
2b117c8436 Revert "Fix primTorch compute_elementwise_output_strides (#88175)"
This reverts commit 1c8a0656d65412b83d3c00f2fc66ab958e991de8.

Reverted https://github.com/pytorch/pytorch/pull/88175 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks cuda 11.6 in trunk. As the PR signal was green, this is probably a landrace
2022-11-03 16:53:04 +00:00
0f6304ef1e disable the out variants in test_cumprod test for inductor (#88328)
`out=` variants aren't supported by autograd and it's not a must fix, so disabling the test (https://github.com/pytorch/torchdynamo/issues/1798) for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88328
Approved by: https://github.com/desertfire
2022-11-03 16:52:37 +00:00
529ba076c6 add an exclude for test_constructor for inductor (#88143)
This test (https://github.com/pytorch/torchdynamo/issues/1800) fails since none of the c-tor ops support `pin_memory=True`. Natalia suggests it's not a priority to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88143
Approved by: https://github.com/desertfire
2022-11-03 16:21:18 +00:00
002dad35f4 better error message for out= ops (#88367)
In cases where a tensor kwarg is actually "out=", the following error message would look nicer than this :
```
Traceback (most recent call last):
  File "/fsx/users/binbao/pytorch/torch/_inductor/graph.py", line 241, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/fsx/users/binbao/pytorch/torch/_inductor/lowering.py", line 168, in wrapped
    assert not any(isinstance(x, TensorBox) for x in kwargs.values())
AssertionError

```

https://github.com/pytorch/torchdynamo/issues/1798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88367
Approved by: https://github.com/desertfire
2022-11-03 16:20:14 +00:00
b4fcfe77b2 reduce the number of autotuning iterations, don't autotune simple til… (#88386)
…ed copies

Partially fixes https://github.com/pytorch/torchdynamo/issues/1807, reduces compile time for me from 360 s to 90s.

Kernels with multiple outputs sometimes autotune to unexpected configs, so I'm limiting the heuristic to relatively safe application.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88386
Approved by: https://github.com/jansel
2022-11-03 15:58:18 +00:00
5e6ceebccb Add support for neg to NestedTensor (#88131)
Partially fixes #86889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88131
Approved by: https://github.com/drisspg
2022-11-03 15:15:57 +00:00
35be73df09 [FSDP()][Easy] Make fully_shard() only FULL_SHARD (#88260)
We can have a separate API for each of the other sharding strategies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88260
Approved by: https://github.com/mrshenli
2022-11-03 13:41:54 +00:00
fc743ec059 [FSDP()] Have fully_shard() abide by @contract! (#88235)
We are making some progress on composability :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88235
Approved by: https://github.com/mrshenli
2022-11-03 13:41:54 +00:00
63cd5d7e27 Add a shortcut in Makefile for updating triton (#88318)
Summary: Local triton installation needs to be updated after we migrate
to a newer version of triton, e.g.
https://github.com/pytorch/pytorch/pull/88242. The Makefile shortcut
makes that easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88318
Approved by: https://github.com/ezyang
2022-11-03 13:32:33 +00:00
f884e817d4 Make Python op registration work with torchdeploy/multipy (#87162)
See strategy at PythonOpRegistrationTrampoline.cpp for the
big picture.

Along the way, I made OperatorHandle support == and hashing,
and slightly changed the low level python_dispatch impl API
to disallow empty strings for dispatch key, which had the knock
on effect of requiring us to explicitly make sure we pass in
CompositeImplicitAutograd if we would have passed in "" (I didn't apply
this to the rest of the file because I'm lazy.)

Test strategy is we delete the logic for preventing Python op
registrations in torch from being skipped in a torchdeploy context
and show CI still works.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87162
Approved by: https://github.com/anjali411, https://github.com/bdhirsh
2022-11-03 12:56:44 +00:00
2f296cfdbb Add a reshape_copy operator. (#88314)
The semantics is "as if" you did a reshape, but it always copied
even if the input was directly view'able.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88314
Approved by: https://github.com/albanD
2022-11-03 12:53:51 +00:00
86c7cd287c Put Python Dispatcher cache in dict, clear it on new registrations. (#88329)
The motivation is that I am going to add the ability to temporarily
install entries to the python dispatcher, and to do that, I need
an easier way to clear the cache.  Putting the cache in a dict
centralizes cache clearing in one place.  I then add some easy
cache clearing.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88329
Approved by: https://github.com/albanD
2022-11-03 12:53:51 +00:00
97d3b200ca Unconditionally enable python dispatcher in AOTAutograd (#88365)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88365
Approved by: https://github.com/Chillee
2022-11-03 12:52:19 +00:00
a689502275 [FSDP] Do not include empty state in _flatten_optim_state_dict() (#88353)
983c0e7f31/torch/optim/adam.py (L163)
The above line requires that a candidate optimizer state dict being loaded via `load_state_dict()` has non-empty state for its 0th parameter (via `state_values[0]`). This PR changes FSDP to only include non-empty mappings in the state returned by `_flatten_optim_state_dict()`, which is the subroutine for both `shard_full_optim_state_dict()` and `flatten_sharded_optim_state_dict()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88353
Approved by: https://github.com/fegin
2022-11-03 11:33:10 +00:00
95a9721a15 [FSDP()][Easy] Rename _State to _FSDPState (#88234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88234
Approved by: https://github.com/mrshenli
2022-11-03 11:29:01 +00:00
0520131ed6 [FSDP()] Rename to fully_shard() and move to _composable/ (#88233)
After internal discussion, we are currently preferring `fully_shard()` as the name of the composable FSDP API.
- `FullyShardedDataParallel` (FSDP) has existing brand value, so the chosen name should try to preserve that. We think this takes precedence over the fact that composable FSDP may encompass than just the ZeRO-3 approach of _fully sharding_.
    - Given the refactoring efforts, it would also not be challenging to create a new frontend API like `hybrid_shard()` that calls into the same underlying initialization and runtime except for a different `ShardingStrategy`. In other words, we do not have to coalesce all sharding strategies under `fully_shard()`.
- The other composable APIs are verbs (`replicate()`, `checkpoint()`), so the chosen name should be a verb.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88233
Approved by: https://github.com/mrshenli
2022-11-03 11:29:01 +00:00
54b6188cc6 [fix] allow saving python attr on Tensor and Parameter via torch.save (#81616)
Fixes: https://github.com/pytorch/pytorch/issues/72129

TODO:
* [x] Fix for Parameter

Benchmark
(Measurable diff for small tensors)
```
[-------------- Save and Load --------------]
                    |  After PR  |  Before PR
1 threads: ----------------------------------
      ()            |    111.7   |     106.9
      (4, 4)        |    114.4   |     109.2
      (128, 128)    |    135.2   |     128.3
      (1024, 1024)  |   1431.9   |    1431.3

Times are in microseconds (us).
```

<details>

<summary> Benchmark Script </summary>

```python
import torch
from torch.testing._internal.common_utils import BytesIOContext
from torch.utils import benchmark
import pickle

shapes = ((), (4, 4), (128, 128), (1024, 1024))

sizes = [1, 64, 1024, 10000]
results = []

def save_load_fn(t):
    with BytesIOContext() as f:
        torch.save(t, f)
        f.seek(0)
        torch.load(f)

for shape in shapes:
    t = torch.randn(shape)
    label = 'Save and Load'
    sub_label = f'{shape}'
    results.append(benchmark.Timer(
        stmt='save_load_fn(t)',
        globals={'t': t, 'save_load_fn':save_load_fn},
        label=label,
        sub_label=sub_label,
        description='Before PR',
    ).blocked_autorange(min_run_time=2))

compare = benchmark.Compare(results)
compare.print()

with open('before_pr.pkl', 'wb') as f:
    pickle.dump(results, f)

# with open('after_pr.pkl', 'rb') as f:
#     after_pr = pickle.load(f)

# with open('before_pr.pkl', 'rb') as f:
#     before_pr = pickle.load(f)

# compare = benchmark.Compare(after_pr + before_pr)
# compare.print()
```

</details>

NOTE : **BC-Breaking** : After this PR, all tensors (also regular tensors) will be serialised using `_rebuild_from_type_v2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81616
Approved by: https://github.com/albanD, https://github.com/kurtamohler
2022-11-03 09:57:47 +00:00
1c8a0656d6 Fix primTorch compute_elementwise_output_strides (#88175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88175
Approved by: https://github.com/ngimel
2022-11-03 08:38:55 +00:00
0efd4e92b5 Make GenLazyNativeFuncDefinition generator to be customizable in lazy codegen (#87823)
As part of the ongoing LTC migration effort, PyTorch/XLA is updating its codegen to use `xla::Shape` instead of `torch::lazy::Shape`. To achieve this, this PR updates the codegen to make the `GenLazyNativeFuncDefinition` generator customizable.

The existing `GenLazyNativeFuncDefinition` is kept by using the initial default values, so this change should not introduce any new behaviors to the existing codegen in PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87823
Approved by: https://github.com/alanwaketan, https://github.com/wconstab
2022-11-03 06:19:40 +00:00
a8f40b39ce Update all ONNX symbolics with new JitScalarType API (#87245)
Fixes https://github.com/pytorch/pytorch/issues/84365 and more

This PR addresses not only the issue above, but the entire family of issues related to `torch._C.Value.type()` parsing when `scalarType()` or `dtype()` is not available.

This issue exists before `JitScalarType` was introduced, but the new implementation refactored the bug in because the new api `from_name` and `from_dtype` requires parsing `torch._C.Value.type()` to get proper inputs, which is exactly the root cause for this family of bugs.

Therefore `from_name` and `from_dtype` must be called when the implementor knows the `name` and `dtype` without parsing a `torch._C.Value`. To handle the corner cases hidden within `torch._C.Value`, a new `from_value` API was introduced and it should be used in favor of the former ones for most cases. The new API is safer and doesn't require type parsing from user, triggering JIT asserts in the core of pytorch.

Although CI is passing for all tests, please review carefully all symbolics/helpers refactoring to make sure the meaning/intetion of the old call are not changed in the new call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87245
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-03 03:01:33 +00:00
b013825c7d [vision hash update] update the pinned vision hash (#88382)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88382
Approved by: https://github.com/pytorchbot
2022-11-03 02:57:27 +00:00
5fb9c113ae Update pybind11 to v2.10.1 (#88332)
I am one of the maintainers of pybind11, and a frequent PyTorch user. We added quite a lot of bugfixes and performance improvements in 2.10.1 (see the changelog for full details) and I wanted to upstream them to PyTorch.

Our releases is tested throughout Google's codebase including on their global builds of PyTorch so there should be no surprises.

The main new feature is optin in Eigen Tensor to Numpy casters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88332
Approved by: https://github.com/soumith
2022-11-03 02:53:26 +00:00
e59d307e2f Improve perf by avoiding implicit string creation in c10_cuda_check_implementation (#88350)
Test Plan: Sandcastle

Differential Revision: D40949947

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88350
Approved by: https://github.com/Skylion007, https://github.com/soumith
2022-11-03 02:48:41 +00:00
a0fb234b45 [codegen] using TORCH_LIBRARY_FRAGMENT for some namespaces (#88229)
Summary:
Sometimes we want to extend an existing custom namespace library, instead of creating a new one,
but we don't have a namespace config right now, so we hardcode some custom libraries defined
in pytorch today, i.e. quantized and quantized_decomposed

Test Plan:
ci

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88229
Approved by: https://github.com/ezyang
2022-11-03 02:30:02 +00:00
7b8cc063ac Not run inductor test in trunk (#88374)
Trying to not run in inductor tests in trunk at the moment because of CUDA issue with G5 runner:

* CUDA GPU not found https://github.com/pytorch/pytorch/actions/runs/3379516207/jobs/5611539300
* NVIDIA driver installation fails https://github.com/pytorch/pytorch/actions/runs/3379922198/jobs/5612458360
* Docker fails to start https://github.com/pytorch/pytorch/actions/runs/3381276196/jobs/5615513348
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88374
Approved by: https://github.com/desertfire
2022-11-03 02:15:07 +00:00
d979caa87c Added add/mul for nested dense [B, *, D], [B, 1, D] case (CUDA-only) (#88289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88289
Approved by: https://github.com/cpuhrsch
2022-11-03 01:29:25 +00:00
4c20c0509d Split out forward AD tests from test_ops_gradients and reenable slow gradcheck CI (#88216)
Fixes: https://github.com/pytorch/pytorch/issues/88010

This PR does a couple things to stop slow gradcheck from timing out:
- Splits out test_ops_fwd_gradients from test_ops_gradients, and factors out TestFwdGradients and TestBwdGradients which both inherit from TestGradients, now situated in common_utils (maybe there is a better place?)
- Skips CompositeCompliance (and several other test files) for slow gradcheck CI since they do not use gradcheck
- because test times for test_ops_fwd_gradients and test_ops_gradients are either unknown or wrong, we hardcode them for now to prevent them from being put together. We can undo the hack after we see actual test times are updated. ("def calculate_shards" randomly divides tests with unknown test times in a round-robin fashion.)
- Updates references to test_ops_gradients and TestGradients
- Test files that are skipped for slow gradcheck CI are now centrally located in in run_tests.py, this reduces how fine-grained we can be with the skips, so for some skips (one so far) we still use the old skipping mechanism, e.g. for test_mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88216
Approved by: https://github.com/albanD
2022-11-03 00:20:45 +00:00
a8561c4571 Revert "[inductor] Handle the case where kwargs contains tensor (#88215)"
This reverts commit 983c0e7f3101f1543bed6c4ec1539a4d590a94c0.

Reverted https://github.com/pytorch/pytorch/pull/88215 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think it breaks trunk https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613987333 with a failure in test_torchinductor_opinfo.py
2022-11-02 23:33:15 +00:00
7354368fd5 [LTC] Remove non-native view ops (#88031)
Summary:
LTC somehow implements a bunch of non-native view ops during the transition to functionalization. Let's remove them now that functionalization is final.

Test Plan:
CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88031
Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim
2022-11-02 23:31:26 +00:00
72f3688029 [Pytorch][Vulkan] Update spv generation script to embed shader parameters (#88321)
This diffs adds shader parameters such as tile size, weight storage type and
format to the generated spv.cpp file.
This is used in ShaderInfo struct that ops such as convolution will use to
determine, the workgroup size  and how to pack weights.

Differential Revision: [D40280337](https://our.internmc.facebook.com/intern/diff/D40280337/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88321
Approved by: https://github.com/jmdetloff, https://github.com/mcr229
2022-11-02 23:28:18 +00:00
6c858e3727 [FSDP][Easy] Remove unneeded TrainingState transition (#88232)
Follow-up from previous PR in the stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88232
Approved by: https://github.com/mrshenli
2022-11-02 23:25:53 +00:00
73de44fc56 [FSDP] Rename unflat_param_name -> fqn for consistency (#88123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88123
Approved by: https://github.com/mrshenli
2022-11-02 23:25:53 +00:00
f35d5145a1 [FSDP] Simplify _get_buffer_names() (#88122)
This is a follow-up from a previous PR in this stack. The PR simplifies the `_get_buffer_names()` implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88122
Approved by: https://github.com/mrshenli
2022-11-02 23:25:53 +00:00
572a3d2d6e [FSDP] Remove unneeded torch.no_grad() context when offloading to CPU (#88121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88121
Approved by: https://github.com/mrshenli
2022-11-02 23:25:53 +00:00
c87f0501ab [FSDP][Docs] Add note mentioning rate limiter for backward prefetch (#88120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88120
Approved by: https://github.com/mrshenli
2022-11-02 23:25:53 +00:00
32d22edc67 [FSDP()][27/N] Add forward hook registration (#88040)
This PR adds the forward hook registration to composable FSDP and adds a unit test for the runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88040
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
2022-11-02 23:25:53 +00:00
6fd416650a Add _foreach_addc(div/mul)(_).Tensor (#88157)
Support passing value scalars as a flat 1D Tensor.

Currently we can only pass either an individual scalar or a ScalarList.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88157
Approved by: https://github.com/ngimel, https://github.com/albanD
2022-11-02 23:24:35 +00:00
91a51fe9f4 [ONNX] Produce comprehensive assertion errors for quantized outputs (#87242)
Fixes #83038

Currently _compare_ort_pytorch_outputs does not produce clearer error messages for differences in the zero point or scale of the two outputs. It also does not produce a clear error message for whether both are quantized.

This pull request adds assertions to output whether the scales and zero points have differences, and whether each individual output is quantized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87242
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-02 23:07:45 +00:00
ca2dc8b4e7 [1/n] Thread PG: fix pyre error of class ProcessGroup (#88281)
Summary: Fix the typing stub of `ProcessGroup` in "torch/distributed/__init__.py", so that it won't confuse pyre, and we can remove a lot of pyre suppression comments.

Test Plan: pyre check

Differential Revision: D40921667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88281
Approved by: https://github.com/wanchaol
2022-11-02 23:02:08 +00:00
d1ba4c3a6d Update Reviewers for CPU-related Modules (#87591)
This PR updates the reviewers responsible for CPU related modules: "IDEEP", "oneDNN graph", "CPU ATen backend", "CPU frontend" and "Autocast". It also adds "NNC" and adds the corresponding reviewers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87591
Approved by: https://github.com/malfet
2022-11-02 22:57:07 +00:00
b325c3fc25 [nvFuser] patches profiling on scalar arguments for std/var (#88165)
Fixes #86531

Added profiling on scalar values for aten::std & aten::var.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88165
Approved by: https://github.com/kevinstephano
2022-11-02 22:47:34 +00:00
bf7c996dcb Revert "torchdynamo support modules() for nn_module (#88023)"
This reverts commit eb91e8a534f94127a6d744543f2080a44bca9e57.

Reverted https://github.com/pytorch/pytorch/pull/88023 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799692855066/insights)
2022-11-02 22:35:14 +00:00
7dfa75546c Print only the driver version from the first GPU (#88364)
For example, distributed test has more than one of them:

```
nvidia-smi --query-gpu=driver_version --format=csv,noheader
515.57
515.57
```

while `--id=0` correctly prints:

```
nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
515.57
```

This is to avoid re-install the same driver as in https://github.com/pytorch/pytorch/actions/runs/3380662072/jobs/5613981088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88364
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2022-11-02 21:59:54 +00:00
943b20e7ae Use tensor cores for NT bmm (#86856)
Copy of internal diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86856
Approved by: https://github.com/drisspg
2022-11-02 21:51:40 +00:00
1c0d47cb17 [PyTorch] Make c10::irange(x) generate the same assembly as for loop (#86841)
`c10::irange(n)` generated an extra `sar` and `andn` instruction compared to a traditional `for` loop. now it doesn't.

Differential Revision: [D40321009](https://our.internmc.facebook.com/intern/diff/D40321009/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86841
Approved by: https://github.com/r-barnes, https://github.com/malfet
2022-11-02 21:34:22 +00:00
ef4ce6d4c6 Add [[noreturn]] attribute to operator() in DispatchKeyExtractor.h (#88333)
Originally D40537408. Submitting this through the diff train workflow to
get it merged faster.

Test Plan:
- Build PyTorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88333
Approved by: https://github.com/ezyang
2022-11-02 21:32:07 +00:00
983c0e7f31 [inductor] Handle the case where kwargs contains tensor (#88215)
Summary: Fix https://github.com/pytorch/torchdynamo/issues/1805;
currently inductor does not allow any tensor in kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88215
Approved by: https://github.com/ngimel
2022-11-02 19:50:16 +00:00
98f09c9ab3 [WIP] Add symnode magic method testing (#88119)
There are failures that need to be addressed before landing:
- Some issue with handling of booleans.
- Most functions return wrong result when mixing int/float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88119
Approved by: https://github.com/ezyang
2022-11-02 19:41:09 +00:00
99c07735e4 Revert "Add support for neg to NestedTensor (#88131)"
This reverts commit 6a75a0d1a197e378ebbf1f73f5ab93ce79cb873a.

Reverted https://github.com/pytorch/pytorch/pull/88131 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/13510799692239080/insights)
2022-11-02 18:43:36 +00:00
0fa23663cc Revert "Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)"
This reverts commit 1e2c4a6e0e60dda763b53f00f25ee5c1f1e5233d.

Reverted https://github.com/pytorch/pytorch/pull/84190 on behalf of https://github.com/malfet due to Needs internal changes, has to be landed via co-dev
2022-11-02 18:13:37 +00:00
4a84d69f50 [functorch.dims] Fix corner cases with permute (#88226)
Previously the permute function was extended to behave like the `order`
function for first-class dimensions. However, unlike `permute`,
`order` doesn't have a keyword argment `dims`, and there is no way to add
it in a way that makes both permute an order to continue to have the same
behavior. So this change just removes the extra functionality of permute,
which wasn't documented anyway. Fixes #88187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88226
Approved by: https://github.com/zou3519
2022-11-02 17:55:43 +00:00
84a302e534 Remove wrong internal assert in handle_view_on_rebase (#88243)
Fixes: https://github.com/pytorch/pytorch/issues/88205

The `CreationMeta::NO_GRAD_MODE` path in handle_view_on_rebase wrongly assumes that the tensor would be a leaf, because tensors created in no_grad are always leaf tensors. However, due to creation_meta propagation, a view of a view created in no_grad also has `CreationMeta::NO_GRAD_MODE`, but DOES have grad_fn.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88243
Approved by: https://github.com/albanD
2022-11-02 17:50:16 +00:00
30dc6cee3a [FSDP()][26/N] Move _lazy_init() into _fsdp_root_pre_forward() (#87941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87941
Approved by: https://github.com/mrshenli
2022-11-02 17:45:08 +00:00
1e2c4a6e0e Introduce TORCH_DISABLE_GPU_ASSERTS (#84190)
- Asserts for CUDA are enabled by default
- Disabled for ROCm by default by setting `TORCH_DISABLE_GPU_ASSERTS` to `ON`
- Can be enabled for ROCm by setting above variable to`OFF` during build or can be forcefully enabled by setting `ROCM_FORCE_ENABLE_GPU_ASSERTS:BOOL=ON`

This is follow up changes as per comment in PR #81790, comment [link](https://github.com/pytorch/pytorch/pull/81790#issuecomment-1215929021)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84190
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2022-11-02 17:41:57 +00:00
b18d0f1dc9 Add more debug information when installing NVIDIA driver (#88168)
This calls `lspci`, `lsmod`, and `modinfo nvidia` before and after the installation to gather more data about the "No GPU available" transient issue on G5 runner, i.e. 59fe272c1e

This also handles `nvidia-smi` call and tries to re-install the driver if the first call fails, i.e. `No devices were found` 8ea19c802e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88168
Approved by: https://github.com/clee2000, https://github.com/malfet
2022-11-02 17:39:07 +00:00
923a5e9685 [dynamo] Error when user nests FX with dynamo (#87797)
Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

Here, we detect the error early and exit with a nicer msg. Also add a
config option to just no-op dynamo (which need to unblock internal
enablement).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel
2022-11-02 17:38:56 +00:00
c503398828 Ignore macos usage log upload artifact failure (#88288)
I'm not quite sure why GitHub starts to get flaky when we are trying to upload usage_log.txt to it (500 Internal server error). But we can live without it, so let's just ignore this for now, and follow up on this latter.

The failures all come from M1 runner, so it seems to point to a connectivity issue between AWS and GitHub:

* https://github.com/pytorch/pytorch/actions/runs/3373976793/jobs/5599310905
* https://github.com/pytorch/pytorch/actions/runs/3372858660/jobs/5597033598
* https://github.com/pytorch/pytorch/actions/runs/3371548201/jobs/5594274444
* https://github.com/pytorch/pytorch/actions/runs/3370877990/jobs/5592709210
* https://github.com/pytorch/pytorch/actions/runs/3370609384/jobs/5592008430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88288
Approved by: https://github.com/clee2000
2022-11-02 17:27:30 +00:00
5b882a34c4 Consolidate macos pip dependencies (#88071)
After conda, consolidating all macos pip dependencies to cache every dependencies that macos CI needs. Two small issues are found along the way in `_mac-test-mps` workflow:

* It didn't have `Install macOS homebrew dependencies` to install libomp like the regular `_mac-test` workflow
* It didn't install `scipy`, thus silently skipping some `signal.windows` tests

Both are fixed in this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88071
Approved by: https://github.com/malfet
2022-11-02 17:22:01 +00:00
f132c171ac [FSDP()][25/N] Add _post_forward_reshard() (#87940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87940
Approved by: https://github.com/mrshenli
2022-11-02 17:16:30 +00:00
5b75b19f51 Revert "Do not use unsafe restriding for subclasses (#87610)"
This reverts commit 73379acaf3865379aed0a1bab1320616772152f3.

Reverted https://github.com/pytorch/pytorch/pull/87610 on behalf of https://github.com/mehtanirav due to [Internal breakages](https://www.internalfb.com/intern/sandcastle/job/36028797828925790/insights)
2022-11-02 16:59:02 +00:00
c00c34fb69 Fix meta for aten.upsample_bilinear2d.vec (#88158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88158
Approved by: https://github.com/ngimel
2022-11-02 16:58:29 +00:00
71fb763e54 Revert "fix as_strided_scatter_backward (#87646)"
This reverts commit f9d7985851f49c3b44383dae50cd77632e7e2245.

Reverted https://github.com/pytorch/pytorch/pull/87646 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but I think this one or one of the PR in the stack break bionic-cuda11.7 on trunk 70782981f0
2022-11-02 16:54:36 +00:00
bf2819a836 [FSDP()][24/N] Refactor _lazy_init() (#87939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87939
Approved by: https://github.com/zhaojuanmao
2022-11-02 16:35:47 +00:00
bd5b4e6504 [Easy] Unused var in functional_adam (#88292)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88292
Approved by: https://github.com/awgu
2022-11-02 16:31:16 +00:00
7382c88df2 [BE][MPS] Do not use malloc/free in 2022 (#88307)
Use `std::vector` to store tensor shapes and automatically free them when array goes out of scope

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88307
Approved by: https://github.com/kulinseth
2022-11-02 16:27:43 +00:00
4e6f5f22fd Run asan's shard 4 on linux.4xlarge (#88310)
In attempt to mitigate OOMs, see https://github.com/pytorch/pytorch/issues/88309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88310
Approved by: https://github.com/albanD
2022-11-02 16:26:11 +00:00
3d90788a58 [ONNX] Add 0d-tensor test case in runtime check (#87212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87212
Approved by: https://github.com/BowenBao
2022-11-02 16:04:21 +00:00
2aed670710 Fix ONNX operator_export_type on the new registry (#87735)
Fixes #87313

Our ONNX pipelines do not run with BUILD_CAFFE2=0, so tests for operator_export_type ONNX_ATEN and ONNX_ATEN_FALLBACK will not be fully tested, allowing regressions to happen again.

We need to run the same set of tests for both BUILD_CAFFE2=0 and 1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87735
Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao
2022-11-02 15:54:40 +00:00
b2679dc61c Remove Krovatkin from dynamic shapes auto request review (#88315)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88315
Approved by: https://github.com/soumith
2022-11-02 15:05:49 +00:00
dcbcf5b90e [profiler] Expose experimental performance events to python (#87905)
Reports total counts (includes time spent in all children), self counts can be calculated manully.

Differential Revision: [D40282770](https://our.internmc.facebook.com/intern/diff/D40282770/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87905
Approved by: https://github.com/SS-JIA
2022-11-02 14:54:15 +00:00
47a542dc06 Nested profiling support for Linux-perf Profiler (#87904)
Add a stack of start counter values, and attribute each disable to the last enable

Differential Revision: [D40539212](https://our.internmc.facebook.com/intern/diff/D40539212/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87904
Approved by: https://github.com/SS-JIA
2022-11-02 14:51:53 +00:00
ebdaeaaa8c [edge profiler] Add e2e test for profiler event and chrometrace (#87877)
* Runs an existing model and checks an aten op if it gets perf events generated in the chrometrace
* Doesn't check for exact values since that's harder to do in a hardware independent way

Differential Revision: [D40474957](https://our.internmc.facebook.com/intern/diff/D40474957/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87877
Approved by: https://github.com/SS-JIA
2022-11-02 14:49:54 +00:00
03346296db [edge profiler] Add support for performance events counting (#87876)
* Add support in lite_predictor benchmark binary to select event lists
* Uses Linux perf through Kineto profiler

Differential Revision: [D39837216](https://our.internmc.facebook.com/intern/diff/D39837216/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39837216/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87876
Approved by: https://github.com/SS-JIA
2022-11-02 14:47:44 +00:00
bc1e9a07a3 [profiler] Add Performance events support in Kineto profiler (#87874)
* Wiring to allow user to pass event names to profiler and reflect the count to the chrometrace
* If not used, the runtime and size overhead should be neglegible
* For now, primary user will be KinetoEdgeCPUProfiler but the impl does not assume that
* Not exposed to python yet

Differential Revision: [D40238032](https://our.internmc.facebook.com/intern/diff/D40238032/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238032/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87874
Approved by: https://github.com/SS-JIA
2022-11-02 14:43:17 +00:00
70782981f0 aot_dispatch test fix: always use functionalization in symbolic tests (#87647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87647
Approved by: https://github.com/ezyang, https://github.com/Chillee
2022-11-02 14:36:49 +00:00
f9d7985851 fix as_strided_scatter_backward (#87646)
as_strided_scatter's derivative formula was broken - instead of making a "mask" of 1's and 0's, it would effectively make a mask of 1's and uninitialized memory.

Fixes https://github.com/pytorch/pytorch/issues/88105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87646
Approved by: https://github.com/albanD
2022-11-02 14:36:49 +00:00
b5a925ff2e propagate .meta info when replacing subgraphs in fx (#87255)
Fixes https://github.com/pytorch/torchdynamo/issues/1708

Our FX subgraph partitioner works by taking all of the original output nodes from a subgraph, and replacing it with a new `call_module` node in the graph.

If the original subgraph outputs had fake tensors and other metadata stored in their `.meta` attribute though, then this information was getting lost when we spliced in the subgraph.

Losing metadata on an FX graph also seems like an easy trap to fall into, so I'm wondering if there are any better guardrails that we can add. I ended up fixing in this PR by adding an optional kwarg to propagate meta info directly in the `fx.Node.replace_all_uses_with`, just because propagating metadata seems like a pretty core thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87255
Approved by: https://github.com/wconstab, https://github.com/SherlockNoMad
2022-11-02 14:36:46 +00:00
5669e10d37 remove assert_allclose from torch.testing (#87974)
See #87969 or #86586 for the reasoning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87974
Approved by: https://github.com/mruberry
2022-11-02 14:05:01 +00:00
b9c617838a remove make_non_contiguous from torch.testing (#87973)
See #87969 or #86586 for the reasoning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87973
Approved by: https://github.com/mruberry
2022-11-02 14:05:01 +00:00
8893c6cd07 remove deprecated dtype getters from torch.testing (#87972)
See #87969 or #86586 for the reasoning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87972
Approved by: https://github.com/mruberry
2022-11-02 14:04:58 +00:00
a360be50b5 remove deprecated device getter from torch.testing (#87971)
See #87969 or #86586 for the reasoning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87971
Approved by: https://github.com/mruberry
2022-11-02 14:04:54 +00:00
554cdc9a63 remove deprecated rand and randn from torch.testing (#87970)
See #87969 or #86586 for the reasoning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87970
Approved by: https://github.com/mruberry
2022-11-02 14:04:51 +00:00
bc73affdad prepare removal of deprecated functionality in torch.testing (#87969)
_Redo of #86586 with all BC breaking changes granularly placed into separate commits._

---

Per title. Deprecation happened on Feb 25, 2022 in c6f1bbc0ac33be0c8ad9956e3fc15e78ddb6cb95, which made it into the 1.12 release. Since it is now 245 days later and the next release will be 1.14, the removals later in the stack comply with the [BC policy](https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#minimizing-the-disruption-of-bc-breaking-changes).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87969
Approved by: https://github.com/mruberry
2022-11-02 14:04:48 +00:00
0fc7de3986 [profiler] Add Linux Perf support (#87866)
* Add support to use Linux kernel perf subsystem via the profiler.
* For now the perf configurability is quite limited to just event names. Threading etc. to come later.
* Given we want to support variety of different cpu types, number of events list (in addition to the standard set of events) is also limited.
* Rather than failing with unsupported feature for non-Linux platforms, it returns zeros for all the event counts.
* For now, max event counts is capped at 4, time multiplexing is not allowed.
* Threadpool recreate hack is restricted to mobile only - need to add better support for threading in general

Differential Revision: [D40238033](https://our.internmc.facebook.com/intern/diff/D40238033/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40238033/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87866
Approved by: https://github.com/SS-JIA
2022-11-02 13:42:24 +00:00
d6b58d6924 [FSDP()][23/N] Refactor handle attr initialization (#87938)
**`_init_param_attributes()` -> `init_flat_param_attributes()`**
We move `_init_param_attributes()` to `FlatParamHandle.init_flat_param_attributes()` (as already marked as to-do during previous refactoring).

**`_reset_lazy_init()`**
We no longer delete `_local_shard` from each `FlatParameter` in `_reset_lazy_init()`.

**Analysis**
Thus, the two semantic differences are that we remove the initial `if hasattr(p, "_local_shard")` early return in `_init_param_attributes()` and the `delattr(p, "_local_shard")` in `_reset_lazy_init()`.

This is safe because
- If we never call `_reset_lazy_init()`, then `init_flat_param_attributes()` is only called once. There is no opportunity for an early return.
- If we call `_reset_lazy_init()`, then `init_flat_param_attributes()` will be called again in the next `_lazy_init()`. However, since we removed the early return, all of the attributes initialized in `init_flat_param_attributes()` simply get re-initialized and override any existing attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87938
Approved by: https://github.com/mrshenli
2022-11-02 11:32:56 +00:00
d172dcf316 [FSDP()][21/N] Refactor and fix _cast_buffers() (#87935)
This PR refactors and fixes `_cast_buffers()`.

**Before**
Buffers were not correctly cast back to their original dtypes for submodules when using buffer mixed precision.
- `_cast_buffers(recurse=False)` incorrectly casts all buffers, including those in submodules. This is because of this outer loop over `self.modules()`:
c40033be16/torch/distributed/fsdp/fully_sharded_data_parallel.py (L700)
- There was a unit test that checked that buffers were cast as expected (`test_mixed_precision_e2e_full_shard()`). The unit test _coincidentally_ passed because all modules shared the same buffer name `"buffer"`. In `_cast_buffers()`, the `dict` mapping buffer name to original dtype is populated lazily (during `_lazy_init()`). However, the keys are unprefixed:
c40033be16/torch/distributed/fsdp/fully_sharded_data_parallel.py (L712-L717)
- Thus, even though (1) `_cast_buffers(recurse=False)` was only called on the root and (2) `self._buffer_name_to_orig_dtype` had unprefixed names as keys, the unit test still passed because (1) `_cast_buffers()` still looped over all buffers despite `recurse=False` and (2) all submodules' buffers were named `"buffer"` and had the same original and low-precision dtypes and hence were cast correctly.

If we change each submodule to have its own distinct buffer name, then the unit test fails. This PR makes such a change to showcase the progression granted by this PR.

**After**
This PR separates `_cast_buffers()` into three methods: `_get_buffers_and_dtypes_for_computation()`, `_get_buffers_and_dtypes_for_checkpoint()`, and `_cast_buffers_to_dtype_and_device()`. This is to separate the different use cases (casting for computation and casting for checkpointing) and the corresponding code paths. Plus, the signature for `_cast_buffers_to_dtype_and_device()` makes it clear exactly what buffers are being cast and to what dtype.

Both `_get_...()` functions assume that they are called on the root only for now. This coincides with the construction of `_buffer_name_to_orig_dtype` in the FSDP constructor, which loops over all submodules. (This means that for non-root modules, their `_buffer_name_to_orig_dtype` is populated but not used.) The `dict`'s keys are clean since the buffer cast to original dtype happens in a `summon_full_params()` context, which cleans the names.

**Follow-Ups**
- We can try to move `_get_buffers_and_dtypes_for_checkpoint()` into `_state_dict_utils.py` in a follow-up.
- We may want to move to per-module buffer casting (i.e. do not have the root module cast for all submodules).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87935
Approved by: https://github.com/mrshenli
2022-11-02 11:32:56 +00:00
b0b1e78e2d [FSDP] Rename dtype to buffer_name_to_dtype (#87934)
This PR is easy and only a rename. `dtype` does not convey that it is actually a `Dict[str, torch.dtype]` (when not `None`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87934
Approved by: https://github.com/mrshenli
2022-11-02 11:32:53 +00:00
d14fc0bc36 [FSDP] Remove device arg from _cast_buffers() (#87933)
This PR is easy. The `device` argument in `_cast_buffers()` is never used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87933
Approved by: https://github.com/mrshenli
2022-11-02 11:32:50 +00:00
19c7df89fb [FSDP()][20/N][Easy] Move functions in file (#87932)
This PR is easy. I just wanted to group functions in the file according to the same logical order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87932
Approved by: https://github.com/mrshenli
2022-11-02 11:32:48 +00:00
4635f56da1 [FSDP()][18/N] Refactor pre_forward_unshard() (#87931)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87931
Approved by: https://github.com/mrshenli
2022-11-02 11:32:45 +00:00
0a752688bd [FSDP()][17/N] Refactor _fsdp_root_pre_forward() (#87930)
This PR moves `_fsdp_root_pre_forward()` to `_runtime_utils.py`.

Note: This PR includes a (temporary) fix for `NO_SHARD` + `CPUOffload(offload_params=True)`, where we set `non_blocking=False` when copying the gradient from device to host. It is only included in this PR since the test was **flaky** (but not consistently failing) on this PR , so I needed to fix to unblock land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87930
Approved by: https://github.com/mrshenli
2022-11-02 11:32:42 +00:00
39d9d2ed70 Implement reference for lerp (#87424)
We follow the vectorised CPU implementation for numerical accuracy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87424
Approved by: https://github.com/ezyang
2022-11-02 11:21:01 +00:00
6b5d7fccc6 Add a basic test for "nvprims_nvfuser" Dynamo backend (#88186)
Ref. https://github.com/pytorch/pytorch/pull/87797#issuecomment-1297635210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88186
Approved by: https://github.com/ezyang
2022-11-02 11:11:28 +00:00
9ebb8d5232 Add ops.broadcast for nvFuser (#88080)
Having nvFuser's `broadcast` available alongside `broadcast_in_dim` would allow easier experimentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88080
Approved by: https://github.com/jjsjann123, https://github.com/kevinstephano, https://github.com/mruberry
2022-11-02 10:05:12 +00:00
2ddefbdc3c Fix typos used in documents under torch directory (#88300)
This PR fixes typos, in comments of Python files, that are found from a search box at https://pytorch.org/docs/master/search.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88300
Approved by: https://github.com/lezcano
2022-11-02 09:38:13 +00:00
4a8382b58e Update caching of tensor arguments for nvFuser's fusion creation (#87860)
Previously nvFuser's fusion definition was cached based on concrete shape and strides of tensor inputs for simplicity and correctness. This PR changes Python's cache to check the number of dimensions, size-1 dimensions, and contiguity information based on given strides and shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87860
Approved by: https://github.com/kevinstephano, https://github.com/jjsjann123, https://github.com/ngimel
2022-11-02 09:29:20 +00:00
ccf6b558a4 [Dynamo] UserFunctionVariable supports type & ABCMeta as arguments (#88257)
Fixes https://github.com/pytorch/torchdynamo/issues/1785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88257
Approved by: https://github.com/ezyang
2022-11-02 06:58:04 +00:00
e763b7abeb [complex] conv_transpose3d : complex support (#87967)
Reference: https://github.com/pytorch/pytorch/issues/71108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87967
Approved by: https://github.com/anjali411
2022-11-02 06:37:33 +00:00
7674af9ce7 [vision hash update] update the pinned vision hash (#88162)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88162
Approved by: https://github.com/pytorchbot
2022-11-02 05:22:40 +00:00
4ab5d79b28 [inductor] Updated some triton.libdevice calls (#88242)
triton master now does not require `d` or `f` suffix
to some libdevice function calls - it dispatches to right
library call based on argument type.

triton pin updated to
f16138d447

Also removed some xfails for some unrelated tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88242
Approved by: https://github.com/ngimel
2022-11-02 04:58:43 +00:00
a51da28551 Support multi-gpu CI for inductor-distributed (#87996)
This test by itself isn't the end goal, but it is a minimal test that exercises multi-gpu and the focus of the PR is the infra behind enabling that.  I'll follow up with more tests using actual models etc.

and @malfet @desertfire for awareness/feedback on the infra side
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87996
Approved by: https://github.com/aazzolini
2022-11-02 03:52:20 +00:00
95fc0bcaad Disable torchdynamo in backwards compiler harder (#88132)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88132
Approved by: https://github.com/bertmaher, https://github.com/malfet
2022-11-02 02:16:35 +00:00
eqy
3c6bddc3f6 [cuDNN] (re-open) Enable cuDNN Frontend v8 API by Default (#87669)
#58414

Has a small tweak to a test that was breaking on A10 (CC @malfet).

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87669
Approved by: https://github.com/ngimel
2022-11-02 01:36:37 +00:00
dfa9475755 Check SM version before calling flash attention with BFloat16 (#86600)
The flash attention code path requires sm80 or newer to run on
BFloat16, so any OpInfo tests running with BFloat16 would fail with
the error:
```
RuntimeError: Expected q_dtype == at::kHalf || (is_sm8x && q_dtype == at::kBFloat16) to be true, but got false.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86600
Approved by: https://github.com/ngimel
2022-11-02 00:52:30 +00:00
bc9caafc78 record_function: update to use custom_class API (#76420)
Re-submit of gh-72302

This still has a small performance hit, but it much smaller. On my
machine I see `_record_fucntion_exit._RecordFunction` takes 1.05 us
compared to the `Tensor` overload taking 0.79 us.

In an overall comparison, I see a 0.7 us slowdown from 6.0 us to
6.7 us for this timeit benchmark
```python
import torch

def foo():
  with torch.profiler.record_function("foo"):
    return torch.eye(3)

%timeit foo()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76420
Approved by: https://github.com/robieta
2022-11-02 00:39:28 +00:00
0131a66ab6 Fix typos under torch directory (#88172)
This PR fixes typos in '.md' files under torch directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88172
Approved by: https://github.com/malfet
2022-11-01 22:58:22 +00:00
72958b9665 [Dynamo] Update Dynamo benchmarks running commands (#87844)
Fixes https://github.com/pytorch/torchdynamo/issues/1761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87844
Approved by: https://github.com/jansel
2022-11-01 22:45:13 +00:00
a56beb2a82 [nvfuser] merge rule update (#88228)
adding Kevin to NVFuser reviewer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88228
Approved by: https://github.com/soumith
2022-11-01 22:43:54 +00:00
fb1586fbcb Make a copy of the submodule inputs (#87899)
Summary: There might be inplace ops in the model that would change the saved inputs. To avoid that, we save a deepcopy version.

Test Plan: CI

Differential Revision: D40771290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87899
Approved by: https://github.com/houseroad
2022-11-01 22:42:04 +00:00
73492645cf Copy DDP code to be reused in composable API (#87836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87836
Approved by: https://github.com/mrshenli
2022-11-01 22:25:10 +00:00
b2dfd20260 Remove BSC conversion skip from TestSparseCompressed.test_consistency (#88152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88152
Approved by: https://github.com/cpuhrsch
2022-11-01 22:18:56 +00:00
d044b4cc58 Update torch.abs and torch.positive opinfos to reflect sparse support (#88151)
cc @nikitaved @pearu @cpuhrsch @bhosmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88151
Approved by: https://github.com/cpuhrsch
2022-11-01 22:18:56 +00:00
ffd54def8f [GHF] Remove CC line from commit message (#88252)
This line is added by autoCCBot, but is not really meaningful as commit
message

Test Plan:
```
>>> from trymerge import GitHubPR, RE_PR_CC_LINE
>>> import re
>>> pr=GitHubPR("pytorch", "pytorch", 87809)
>>> re.sub(RE_PR_CC_LINE, "", pr.get_body())
'Fixes #ISSUE_NUMBER\r\n\n\n'
>>> pr=GitHubPR("pytorch", "pytorch", 87913)
>>> re.sub(RE_PR_CC_LINE, "", pr.get_body())
'Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR.\n\n'
>>> pr=GitHubPR("pytorch", "pytorch", 85692)
>>> re.sub(RE_PR_CC_LINE, "", pr.get_body())
'This PR sets CUDA_MODULE_LOADING if it\'s not set by the user. By default, it sets it to "LAZY".\r\n\r\nIt was tested using the following commands:\r\n```\r\npython -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"\r\n```\r\nwhich shows a memory usage of: 287,047,680 bytes\r\n\r\nvs\r\n\r\n```\r\nCUDA_MODULE_LOADING="DEFAULT" python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"\r\n```\r\nwhich shows 666,632,192 bytes. \r\n\r\nC++ implementation is needed for the libtorch users (otherwise it could have been a pure python functionality).\r\n\r\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88252
Approved by: https://github.com/xuzhao9, https://github.com/izaitsevfb
2022-11-01 22:17:12 +00:00
ba643b4ddf feature: adding batch support for narrow_copy operator (#88130)
Implement batch support https://github.com/pytorch/functorch/issues/825 for narrow copy

narrow_copy was already added as an opinfo

cc @zou3519 @Chillee @samdow @soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88130
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2022-11-01 21:42:51 +00:00
c40033be16 [Vulkan][TCC] Implement tests for cat_batch, cat_width and normalize_dim (#87633)
Summary:
Implement Vulkan tests for these untested functions in Concat.cpp:
 - cat_batch
 - cat_width
 - normalize_dim

Test Plan:
```cd ~/fbsource
buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
```

Differential Revision: D40605571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87633
Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign, https://github.com/SS-JIA
2022-11-01 21:01:31 +00:00
e6ea0a4a4b Don't Require contiguous For Extern Kernels (#87650)
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87650
Approved by: https://github.com/desertfire
2022-11-01 20:20:42 +00:00
8ef9bda1bf Fix nvFuser Fusion Definition printing of Squeeze and Permute (#88041)
NM

cc @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88041
Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry
2022-11-01 19:02:40 +00:00
68f9f256a3 [reland][fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87998)
Summary:
att, this is experimental api so not marking it as bc-breaking.
The match will be accepted only if all the filters in the list passes.
Changing the filter arg to be list also allows us to pass in empty list that means no filter, which makes user code cleaner.

Test Plan:
python test/test_fx.py -k test_replace_pattern_with_filters

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D40810943](https://our.internmc.facebook.com/intern/diff/D40810943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87998
Approved by: https://github.com/SherlockNoMad
2022-11-01 18:48:14 +00:00
2c7de4a144 Add meta implementation for aten.max.dim (#88005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88005
Approved by: https://github.com/Chillee, https://github.com/bdhirsh
2022-11-01 18:37:24 +00:00
97b3eeac90 remove assert on tensor inputs to FusionGroup (#88018)
Fixes #86530 #86227 #85872
All issues seem to be duplicate of each other.

Removes the false positive assert

Fixes come from @kevinstephano
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88018
Approved by: https://github.com/kevinstephano, https://github.com/soumith
2022-11-01 18:07:17 +00:00
e1c123d29a Add UBSAN to ASAN (#88055)
Add undefined behavior sanitizer to `USE_ASAN` option.
Added `torch._C._crash_if_vptr_ubsan()` that only fails if vptr belongs to a wrong class after typecast
Deleted all ubsan supressions, but disabled `ProtoTest::Basic` as it fails above-mentioned vptr check.

Fixes https://github.com/pytorch/pytorch/issues/88042
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88055
Approved by: https://github.com/ezyang
2022-11-01 17:59:35 +00:00
81f74eed75 [11/N] Update all_to_all with CPU/CUDA implementations (#86407)
* #83916 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86407
Approved by: https://github.com/kwen2501
2022-11-01 17:54:13 +00:00
90fa25705c Rename 'nvfuser' to 'ts_nvfuser' indicating TorchScript usage (#88188)
cc @kevinstephano @jjsjann123 @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88188
Approved by: https://github.com/soumith, https://github.com/jansel
2022-11-01 17:46:55 +00:00
bed8102741 [10/N] Update barrier with CPU/CUDA implementations (#86368)
### Changes
- Updates for the barrier collective
- NOTE: current change will not achieve dispatching of barrier since there is no tensor to read from

### Context
https://github.com/pytorch/pytorch/issues/86225

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86368
Approved by: https://github.com/kwen2501
2022-11-01 17:41:01 +00:00
1f34067e9d [FSDP()][16/N] Refactor post-forward/pre-backward (#87929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87929
Approved by: https://github.com/mrshenli
2022-11-01 17:26:03 +00:00
5a53f024e4 [FSDP()][15/N] Refactor _init_streams() (#87928)
This PR is easy. I think I move `_init_streams()` again in a later PR though :/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87928
Approved by: https://github.com/mrshenli
2022-11-01 17:26:03 +00:00
90c5f856b2 [FSDP()][14/N] Refactor pre-forward/post-backward (#87927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87927
Approved by: https://github.com/mrshenli
2022-11-01 17:25:59 +00:00
eb91e8a534 torchdynamo support modules() for nn_module (#88023)
Differential Revision: D40820879

This diff allows models to call self.modules() during dynamo tracing.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88023
Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym, https://github.com/jansel
2022-11-01 17:10:45 +00:00
de1f641f11 Fix meta function for aten.addmm (#88068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88068
Approved by: https://github.com/albanD
2022-11-01 17:05:48 +00:00
fdc419786d Add unit test for torch_geometric library (#85937)
Fixes #65138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85937
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-11-01 16:43:58 +00:00
5c3666cb81 [codev] Make backport work with flatbuffer models (#88127)
Summary: By adding flatbuffer as dependency of backport.

Differential Revision: D40865452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88127
Approved by: https://github.com/cccclai
2022-11-01 16:11:30 +00:00
bb7e6254e4 Add ability to freeze storages inside functionalization (#88141)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88141
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-11-01 16:00:33 +00:00
61f955dd83 Inline Alias into FunctionalStorageImpl (#88140)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88140
Approved by: https://github.com/bdhirsh
2022-11-01 16:00:33 +00:00
73c9911fc0 always realize output regardless of the number of reads (#88046)
This improves hf_Bert 1.139x->1.21x, currently lowmem dropout doesn't work for nn.Dropout module, and before this change we were recomputing all the dropout masks in a very inefficient kernel. This change pushes dropout masks to be saved in the dropout kernels where they are first computed.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88046
Approved by: https://github.com/Chillee
2022-11-01 15:47:43 +00:00
c368c0faf0 Fix meta for aten.fill, constant_pad_nd, _adaptive_avg_pool2d (#88069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88069
Approved by: https://github.com/ngimel, https://github.com/malfet
2022-11-01 15:36:06 +00:00
82a9de16d4 Change dynamo/distributed tests to use cuda/nccl (#88133)
- FSDP tests require nccl
- also run in inductor shard and skip inductor in distributed shard
- inductor shard has newer GPU and supports triton/inductor, but only runs on trunk
- distributed shard runs on PR, but inductor shard only runs on trunk/opt-in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88133
Approved by: https://github.com/davidberard98
2022-11-01 15:35:44 +00:00
44f8efd5c1 [BE]fix DDP when the number of output features is zero (#87793)
Fixes #87280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87793
Approved by: https://github.com/rohan-varma
2022-11-01 15:27:40 +00:00
20d849b982 [9/N] [Dispatchable Collectives] Update reduce_scatter with CPU / CUDA implementations (#86166)
### Changes
- Updates for the reduce_scatter collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86166
Approved by: https://github.com/kwen2501
2022-11-01 15:23:41 +00:00
1e5d33b6df Reenable assert sanity testing with ADInplaceOrView reenable (#88102)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88102
Approved by: https://github.com/albanD
2022-11-01 14:29:00 +00:00
bdb14238ec [Reland][ONNX] Move all torch.onnx.export related tests to test/onnx (#87292)
Moving torch.onnx.export related tests to test/onnx integrates ONNX tests to the same CI machine, so the testing environment can be better managed.

Fixes https://github.com/pytorch/pytorch/issues/87320
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87292
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/kit1980, https://github.com/malfet
2022-11-01 14:22:46 +00:00
62988e4fe6 Update _distributed_c10d.pyi (#88088)
Summary: `_distributed_c10d.pyi` is out of sync with the C++ binding. This change updates it.

Test Plan: TBD

Differential Revision: D40840836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88088
Approved by: https://github.com/wanchaol
2022-11-01 13:51:06 +00:00
b1750d0440 [FSDP()][13/N] Refactor unshard/reshard/grads (#87926)
This PR is not too complicated. We just move unshard/reshard/grads out to `_runtime_utils.py` and make them take `state: _State` instead of `self`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87926
Approved by: https://github.com/mrshenli
2022-11-01 13:37:31 +00:00
8039317c07 [FSDP()][12/N] Easy cleanup (#87925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87925
Approved by: https://github.com/mrshenli
2022-11-01 12:39:24 +00:00
c1e28731b3 [FSDP()][10/N][11/N] Introduce composable (ctor only) (#87924)
This PR introduces the composable FSDP API (with constructor semantics only) along with some further constructor refactoring. A notable contribution here is `_get_submodule_to_states()`, which performs auto wrapping without actually wrapping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87924
Approved by: https://github.com/mrshenli
2022-11-01 12:39:24 +00:00
78170701a3 [FSDP()][9/N] Refactor ctor (continued) (#87923)
This PR makes a second pass over the constructor. The logic has been grouped into `_init_<...>` functions based on intent (e.g. `_init_prefetching_state()` or `_init_runtime_state()`). This makes the initialization code for composable FSDP much cleaner than having to re-write the same sequences of lower-level helper calls.

This PR also moves `_ExecOrderData` into its own file `_exec_order_utils.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87923
Approved by: https://github.com/mrshenli
2022-11-01 12:39:21 +00:00
23fe6c8ca1 [Static Runtime] Fix ReplaceWithMaybeCopy test in OSS (#88099)
Summary: `ReplaceWithMaybeCopy` is guarded by `FBCODE_CAFFE` in `OptimizeGraph`. Run the pass manually to ensure it does the replacement.

Test Plan: Existing tests

Differential Revision: D40858743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88099
Approved by: https://github.com/huydhn
2022-11-01 09:58:26 +00:00
7c6fe21a38 Fix monitoring script for macos (#88159)
The monitoring script is currently failing with AccessDenied when trying to access uss memory on mac because [psutil.memory_full_info](https://psutil.readthedocs.io/en/latest/index.html?highlight=memory_full_info) requires higher user privileges

Example failures:
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-12_9208104847.zip
* https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3363066309/1/artifact/usage-log-test-default-2-2-macos-m1-12_9207913759.zip

I could also make this script run with sudo, effectively granting this permission. But I'm not entirely sure that we need uss memory for mac, so gracefully handling the error looks nicer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88159
Approved by: https://github.com/clee2000
2022-11-01 05:58:44 +00:00
323c646ca9 Cleaned up the nvFuser Python Frontend Batch Norm printing (#88057)
* Removed `define_null_tensor` usage in favor of using optional arguments for binding.
* Re-ordered the non-State arguments for easier printing.
* Added a printing function to include booleans `training` and `channels_last`
* Fixed `define_tensor` to print `is_cpu`

cc @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88057
Approved by: https://github.com/IvanYashchuk, https://github.com/jjsjann123, https://github.com/mruberry
2022-11-01 05:05:15 +00:00
a6acbad5c3 [BE] Use default constructor in LoggerVoidify (#88054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88054
Approved by: https://github.com/kit1980
2022-11-01 03:59:51 +00:00
560786ac20 call contiguous on BMM inputs for NT on CUDA (#88108)
Fixes #87713

BMM for cpu supports  non-contiguous nested tensor inputs, while BMM for Cuda does not support currently non-contiguous inputs.

The derivative for BMM:
```
- name: bmm(Tensor self, Tensor mat2) -> Tensor
  self: grad.bmm(mat2.transpose(1, 2).conj())
  mat2: self.transpose(1, 2).conj().bmm(grad)
  result: self_t.bmm(mat2_p) + self_p.bmm(mat2_t)
```

When calling backward it was impossible for this function to succeed since the inputs were always discontiguous, regardless of the user input.  This adds contiguous calls to BMM_cuda implementation for nested tensors.

This was not caught by tests because grad_check is currently only done on CPU in test_nestedtensors. This PR updates the autograd test to also be run on GPU.

As a result I found one more issue with the backward for to_padded_tensor erroring instead of calling the generic version.

cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88108
Approved by: https://github.com/cpuhrsch
2022-11-01 03:14:27 +00:00
0eea05b11e Remove "prims_nvfuser" backend for TorchDynamo (#88083)
Removing "prims_nvfuser" backend according to the discussion in https://github.com/pytorch/torchdynamo/pull/1281#discussion_r979468355.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88083
Approved by: https://github.com/ezyang
2022-11-01 03:09:37 +00:00
a8aaee77be [torch::deploy] add gpu unit tests to CI (#88107)
Adds `torch::deploy`'s GPU tests to core CI to make sure core changes don't break them.

Overall, deploy tests take 11 min, so it shouldn't be much of a burden :)  https://github.com/pytorch/pytorch/actions/runs/3364231795/jobs/5578861939

Differential Revision: [D40861442](https://our.internmc.facebook.com/intern/diff/D40861442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88107
Approved by: https://github.com/d4l3k, https://github.com/anirbanr-fb-r2p
2022-11-01 02:54:44 +00:00
6a75a0d1a1 Add support for neg to NestedTensor (#88131)
Partially fixes #86889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88131
Approved by: https://github.com/drisspg
2022-11-01 02:37:42 +00:00
708c050af9 Add labeler with cpu, mkldnn, amp, NNC and quantization paths to start (#87690)
This PR is to dd labeler with `module: cpu`, `module: mkldnn`, `module: amp (automated mixed precision)`, `NNC` and `oncall: quantization' paths to start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87690
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-11-01 02:06:30 +00:00
3aa7a52855 [xnnpack][lite-int][4/n] introduce serialization to delegate (#87908)
We introduced the serializer we created in the previous diff to our XNNGraph builder, the purpose of this is to serialize parts of the graph as we build this. At the end, we are able to finish and serialize the xnngraph into a std::string for use when we forward this along to on-device runtime.

The next diff will rebuild the xnngraph from the serialization we introduce here, so testing the serialization of the graph will be done in the next diff

Differential Revision: [D39335580](https://our.internmc.facebook.com/intern/diff/D39335580/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39335580/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87908
Approved by: https://github.com/digantdesai
2022-11-01 01:48:32 +00:00
8287c1d964 [xnnpack][lite-int][3/n] flatbuffer serializer class (#87907)
Creating a serializer class that allows us to serialize the xnnpack graph creation arguments. This essentially abstracts away the flatbuffer api manipulation and serialization that we deal with.

As a result we can call
```
XNNSerializer::serializeAddNode()
XNNSerializer::serializeTensorValue()
XNNSerializer::finishAndSerialize
```
to serialize the graph

Differential Revision: [D39196312](https://our.internmc.facebook.com/intern/diff/D39196312/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39196312/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87907
Approved by: https://github.com/digantdesai
2022-11-01 01:44:18 +00:00
7bf819b181 [xnnpack]lite-int][2/n] flatbuffer xnn_value schema (#87906)
serializer schema for xnnpack graphs

Differential Revision: [D39003170](https://our.internmc.facebook.com/intern/diff/D39003170/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87906
Approved by: https://github.com/digantdesai
2022-11-01 01:39:41 +00:00
905d532d39 [xnnpack][lite-int][1/n] flatbuffer buck rules (#87826)
Writing a placeholder schema.fbs file for now to setup the buck gen rules. The generated schema file will be used in the xnnpack name space and be reserved for serialization/deserialization of our xnnpack lowered graph

Steps Accomplished

- Buck rules to compile flatbuffer schema
- added header file to preprocess
- everything compiles correctly

Differential Revision: [D38999169](https://our.internmc.facebook.com/intern/diff/D38999169/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38999169/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87826
Approved by: https://github.com/digantdesai
2022-11-01 01:36:52 +00:00
aa1f9a1bd7 [xnnpack][lite-int][graph-build] torchscript -> xnnpack graph (#87824)
This point we perform conversion for Torchscript IR to XNNPack graph. Currently we only support converting Add Nodes and fp32 tensor values.

As a caveat, we are not building this at runtime. So for testing we just run the xnn graph once ahead of time and with sample inputs and forward it to execute. This is only for testing, and will be changed in a later diff. This will allow us to check that graph creation is sound.

Differential Revision: [D39838851](https://our.internmc.facebook.com/intern/diff/D39838851/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87824
Approved by: https://github.com/digantdesai, https://github.com/salilsdesai
2022-11-01 01:24:56 +00:00
d596b048e5 Also skip large models for normal --accuracy runs (#88086)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88086
Approved by: https://github.com/albanD
2022-11-01 00:59:09 +00:00
afd00673b6 Change Nested Tensor logging copy (#88104)
# Summary
Change the copy of how we log NestedTensor usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88104
Approved by: https://github.com/mikaylagawarecki
2022-11-01 00:00:35 +00:00
c0761a835b Revert "[dynamo] Error when user nests FX with dynamo (#87797)"
This reverts commit 1da5aeb97b73664ff0fe2f4bb48379655cede969.

Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/ezyang due to breaks nvfuser stack, needs more investigation
2022-10-31 23:49:37 +00:00
caaf37a111 Fix PyTorchStreamWriter exception handling (#88128)
Avoid double exception in destructor if attempting to serialize to
python object that does not have `write` method

Use `Finalizer` class in `PyTorchStreamWriter::writeEndOfFile()` to a
always set `finailized_` property even if excretion occurs. (as there
isn't much one can do at this point)

Add expicit check for the attribue to `_open_zipfile_writer_buffer` and
add unitests

Modernize code a bit by using Python-3 `super()` method

Fixes https://github.com/pytorch/pytorch/issues/87997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88128
Approved by: https://github.com/albanD
2022-10-31 23:38:03 +00:00
ea8a5b09a9 [IOS] Update Cocoapods for 1.13 release (#88075)
Update the podspecs for libtorch and libtorch-lite to v 1.13 to prepare for the 1.13 pod release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88075
Approved by: https://github.com/manuelcandales, https://github.com/salilsdesai, https://github.com/malfet
2022-10-31 23:36:00 +00:00
bc03aa6013 Store autocast_gpu_dtype in custom_fwd and custom_bwd for BFloat16 autocast (#88029)
As per #87979, `custom_bwd` seems to forcefully use `torch.float16` for `torch.autograd.Function.backward` regardless of the `dtype` used in the forward.

Changes:
- store the `dtype` in `args[0]`
- update tests to confirm the dtype of intermediate result tensors that are outputs of autocast compatible `torch` functions

cc @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88029
Approved by: https://github.com/ngimel
2022-10-31 22:45:26 +00:00
f2b247f0d8 Remove stale comment (#88135)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88135
Approved by: https://github.com/albanD
2022-10-31 22:29:07 +00:00
139afc50ec Fix links to tutorial in torch masked docs (#88129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88129
Approved by: https://github.com/jisaacso
2022-10-31 21:31:54 +00:00
9fed04ba33 fix for auto labeler (#88100)
followed https://lightrun.com/answers/actions-labeler-how-to-only-add-label-not-remove-when-pr-is-opened

side note: should we move this logic to test-infra to be with the release notes labeler?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88100
Approved by: https://github.com/huydhn
2022-10-31 21:12:54 +00:00
ba26bc0fc2 Fix random "C1041: cannot open program database" errors when compiling on Windows (#88084)
Adds `/FS` option to `CMAKE_CXX_FLAGS` and `CMAKE_CUDA_FLAGS`.

So far I've encountered this kind of errors:

```
C:\Users\MyUser\AppData\Local\Temp\tmpxft_00004728_00000000-7_cuda.cudafe1.cpp: fatal error C1041: cannot open program database 'C:\Projects\pytorch\build\third_party\gloo\gloo\CMakeFiles\gloo_cuda.dir\vc140.pdb'; if multiple CL.EXE write to the same .PDB file, please use /FS
```
when building with VS 2022.

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm

Related issues:
- https://github.com/pytorch/pytorch/issues/87691
- https://github.com/pytorch/pytorch/issues/39989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88084
Approved by: https://github.com/ezyang
2022-10-31 21:11:16 +00:00
73379acaf3 Do not use unsafe restriding for subclasses (#87610)
This helps convert some accuracy errors into runtime errors,
which makes it easier to debug.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87610
Approved by: https://github.com/albanD
2022-10-31 20:49:15 +00:00
6fe41e76a9 Create separate files for NT Unary, Binary and Matmul ops (#88091)
Improves code organization and code share.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88091
Approved by: https://github.com/drisspg
2022-10-31 20:10:07 +00:00
1a9edc8136 Changing from sample_inputs to reference_inputs in test_compare_cpu (#86462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86462
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-31 20:06:03 +00:00
4c78c7c82a Enable src_mask in fast path of TransformerEncoderLayer (#87377)
## Issues
Fixes https://github.com/pytorch/pytorch/issues/81129#issuecomment-1179435674

## Description

Passing a 2D attention mask `src_mask` into the fast path of `TransformerEncoderLayer` in CPU was causing an error and so was disabled in https://github.com/pytorch/pytorch/pull/81277. This PR unrolls this fix, enabling `src_mask` on the fast path:

- Either attention mask `src_mask` of shape `(L, L)` or padding mask `src_key_padding_mask` of shape `(B, L)` are now allowed on the CPU fast path. If softmax is applied along the last dimension (as in multi-head attention), these masks are processed without expanding them to 4D. Instead, when iterating through the input, `Softmax.cpp::host_softmax` converts the index to match the mask dimensions, depending on the type.
- If softmax is applied along the dimension other than the last, `Softmax.cpp::masked_softmax_cpu` expands masks to 4D, converting them to `mask_type=2`. Theoretically one could also add special optimized cases for `dim=0, 1, 2` and process them without mask expansion, but I don't know how often is that used

## Tests:
- `test_transformerencoderlayer_fast_path` is extended to cover both attention mask and padding mask
- `test_masked_softmax_mask_types_0_1` is added to ensure results from CPU softmax with attention and padding masks match the explicit slow calculation
- `test_masked_softmax_devices_parity` is added to ensure results from masked softmax on CPU and CUDA match

## Note
I had to replace `float` with `torch.get_default_dtype()` in a couple of tests for the following reason:
- `test_nn.py` [sets the default type to `torch.double`](https://github.com/pytorch/pytorch/blob/master/test/test_nn.py#L24-L26)
- If I execute `test_nn.py` and `test_transformers.py` in one `pytest` run, this default still holds for transformer tests
- Some tests in `test_transformers.py` which were previously following the slow path now switched to fast path, and hard-coded `float` started clashing with default `double`

Let me know if there is a better way around it - or maybe I'm not supposed to run tests with `pytest` like this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87377
Approved by: https://github.com/mikekgfb, https://github.com/weiwangmeta, https://github.com/malfet
2022-10-31 19:59:36 +00:00
e9599724fa Revert "[ONNX] Move all torch.onnx.export related tests to test/onnx (#87292)"
This reverts commit e3e84830aade59722d819bc5fa01922239494790.

Reverted https://github.com/pytorch/pytorch/pull/87292 on behalf of https://github.com/weiwangmeta due to breaking internal test relating to quantization eager tests, see test/quantization/eager/test_quantize_eager_ptq.py test_lower_graph_linear and test_lower_graph_conv2d
2022-10-31 19:55:58 +00:00
e9cabef663 enable xpu group norm channels last support (#87680)
XPU would support channels last format for group norm operator, however, Pytorch converts all input tensor to contiguous format, which includes channels last tensor. Need Pytorch pass down this memory format hint to us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87680
Approved by: https://github.com/albanD
2022-10-31 19:46:01 +00:00
7d2f1cd211 Fix typos under docs directory (#88033)
This PR fixes typos in `.rst` and `.Doxyfile` files under docs directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88033
Approved by: https://github.com/soulitzer
2022-10-31 19:31:56 +00:00
c7ac333430 Fix args for meta__fused_moving_avg_obs_fq_helper (#88058)
Fixes https://github.com/pytorch/torchdynamo/issues/1802

There are a few problems,
1. torch.fused_moving_avg_obs_fake_quant doesn't have OpInfo test
2. self.empty_like() is not a valid call. it should be torch.empty_like(self)
3. python meta function has some unexplained behavior for arguments with default value of bool type?

In particular, problem 3 is the most concerning one.
**UPDATE: This is expected behavior, see discussion below for explanation.**

Without setting the default value for `per_row_fake_quant` and `symmetric_quant`, it gets the following error when running with meta tensor.
```
meta__fused_moving_avg_obs_fq_helper() missing 2 required positional arguments: 'per_row_fake_quant' and 'symmetric_quant'
```
I can fix this by adding the default values to these two args. However, I observer something strange when examining the actual value in meta function.

```
    print("per_row_fake_quant", per_row_fake_quant)
    print("symmetric_quant", symmetric_quant)
```

When default values are False, printed value correctly reflect the args value populated from call site.
When default values are True, printed value is ALWAYS True, regardless of the populated value from call site.
When default Values are None, printed value is `None` when call site set the value to 'False', printed value is 'True' when call site sets the value to 'True'.

I also verify that this bug also affect for other meta function with default args....

My speculation is that this is something about pybind value packing when called from c++ dispatcher to python meta function, and default value parsing for python meta function (and other python dispatch functions) ?

I tried to find the c++ call stack, but gdb is missing symbols and C++ stacktrace is not working properly... Appreciate anyone who can point me to the source file for pybind value packing.

cc @ezyang
cc @bdhirsh. I know you had a fix in the symbolic shape branch...
cc @yanboliang  who reported this bug
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88058
Approved by: https://github.com/bdhirsh, https://github.com/yanboliang
2022-10-31 19:00:16 +00:00
3eb379052d unfold_backward: Remove stride >= size kernel in favour of copy_ (#88061)
unfold_backward has a dedicated kernel for `stride >= size` which uses temporary
tensors created by `at::arange` to perform the mapping from unfolded to folded.
This instead uses `unfold` to view the output, and does a direct copy from the
gradient into the view.

In benchmarks I see either no difference or a marginal speed benefit from
this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88061
Approved by: https://github.com/albanD
2022-10-31 18:45:44 +00:00
ceddcf5434 istft: Use unfold_backward instead of col2im (#88060)
`unfold_backward` implements the same operation as `col2im` but without support
for 2d kernels or dilation. However, `istft` doesn't use any of those features
and `unfold_backward` actually has a faster `TensorIterator` based
implementation so we should use it here instead.

In the example from #87353 I see a 2x speedup on both CPU and CUDA.

On a wider variety of sizes and inputs I still see speedups across the board, especially
on CPU since `col2im` isn't parallelized but `unfold_backward` is:

| device | shape           | hop_length | Master (us) | This PR (us) | Speedup |
|--------|-----------------|------------|-------------|--------------|---------|
| CUDA   | (1, 129, 33)    | 256        | 147         | 136          | 1.08    |
|        |                 | 128        | 153         | 128          | 1.20    |
|        | (100, 129, 20)  | 256        | 181         | 147          | 1.23    |
|        |                 | 128        | 171         | 137          | 1.25    |
|        | (1000, 129, 10) | 256        | 681         | 443          | 1.55    |
|        |                 | 128        | 632         | 446          | 1.42    |
| CPU    | (1, 129, 33)    | 256        | 106         | 104          | 1.02    |
|        |                 | 128        | 103         | 81           | 1.27    |
|        | (100, 129, 20)  | 256        | 2400        | 399          | 6.02    |
|        |                 | 128        | 2150        | 313          | 6.87    |
|        | (1000, 129, 10) | 256        | 13800       | 3740         | 3.69    |
|        |                 | 128        | 12700       | 2110         | 6.02    |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88060
Approved by: https://github.com/albanD
2022-10-31 18:45:44 +00:00
ff94494644 Revert "Revert "Unify meta tensor and fake tensor converter conversion (#87943)"" (#88045)
This reverts commit bc64999b8382796199178cf480adf51512b5f139.

Check torch/_subclasses/meta_utils.py for "This is very tricky" for the bugfix explanation.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88045
Approved by: https://github.com/kit1980, https://github.com/Chillee
2022-10-31 17:50:14 +00:00
2e1199d171 [quant][fx] Fix a typo in utils.py (#88024)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88024
Approved by: https://github.com/HDCharles, https://github.com/z-a-f
2022-10-31 17:31:58 +00:00
0a4ca9d083 Fix meta for aten.angle and aten.index_copy (#88066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88066
Approved by: https://github.com/albanD
2022-10-31 17:11:29 +00:00
a3f8495b84 [primTorch fix] use _maybe_convert_to_dtype (#85163)
Fixes #84561

- [x] fix lint tests

cc: @Lezcano!!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85163
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-31 17:08:55 +00:00
2702aaffc0 remove old label check functionality (#88007)
no longer needed as we have check_labels.py to check if the pr has labels and it blocks merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88007
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
2022-10-31 16:52:58 +00:00
83f31ffdfe Move check labels to separate workflow (#87999)
* moves check labels to separate workflow that is triggered on the usual pull_request triggers as well as labeled and unlabeled
* deletes comments when label is added

Fixes https://github.com/pytorch/test-infra/issues/978 and https://github.com/pytorch/pytorch/issues/87865
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87999
Approved by: https://github.com/huydhn
2022-10-31 16:52:30 +00:00
5723fd503c Fix meta function for aten.flip and aten.rot90 (#88065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88065
Approved by: https://github.com/mruberry
2022-10-31 16:52:05 +00:00
9308cefbdf [FSDP()][8/N] Refactor limiter's _FreeEventQueue (#87922)
This PR is easy. It just moves `_FreeEventQueue` into its own file `_limiter_utils.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87922
Approved by: https://github.com/rohan-varma, https://github.com/mrshenli
2022-10-31 16:45:24 +00:00
d89cf2fdc9 [FSDP()][7/N] Refactor most of ctor (#87921)
The goal of this PR is to make one pass over the FSDP constructor and refactor each helper method call to not be `self.<...>`. Subsequent PRs will make further passes over the FSDP constructor.

This PR looks like a lot of lines of code change, but it is only reorganization. Methods are moved to `_init_utils.py` and `_common_utils.py`. This also marks the beginning of moving methods from `_utils.py` to `_common_utils.py` -- they will be coalesced eventually. I am only using `_common_utils.py` as a staging ground to include the methods that have been affected by the refactoring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87921
Approved by: https://github.com/mrshenli
2022-10-31 16:45:24 +00:00
9d9267c6f7 [FSDP()][3/N] Refactor public APIs (#87917)
- This PR defines a new `api.py` meant to hold the public API for FSDP (minus `FullyShardedDataParallel` itself). This is needed because several of the `_<...>_utils.py` files rely on the public API, and we cannot import from `torch.distributed.fsdp.fully_sharded_data_parallel` without a circular import. Calling the file `api.py` follows the convention used by `ShardedTensor`.
- This PR cleans up the wording in the `BackwardPrefetch`, `ShardingStrategy`, `MixedPrecision`, and `CPUOffload` docstrings.
- This PR adds the aforementioned classes to `fsdp.rst` to have them rendered in public docs.
- To abide by the public bindings contract (`test_public_bindings.py`), the aforementioned classes are removed from `fully_sharded_data_parallel.py`'s `__all__`. This is technically BC breaking if someone uses `from torch.distributed.fsdp.fully_sharded_data_parallel import *`; however, that does not happen in any of our own external or internal code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87917
Approved by: https://github.com/mrshenli
2022-10-31 16:45:21 +00:00
59fe272c1e Fix: prefer .is_none() over .is(py::none()) for pybind11 (#88051)
Fixes minor perf regression I saw in #85688 and replaced throughout the code base. `obj == Py_None` is directly equivalent to is_none(). Constructing a temporary py::none() object needlessly incref/decref the refcount of py::none, this method avoids that and therefore is more efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88051
Approved by: https://github.com/albanD
2022-10-31 16:41:27 +00:00
75dbe37909 make autocast cache global instead of thread-local (#86492)
Summary:

There is a memory leak because `torch.clear_autocast_cache()` clears
the autocast cache from the main thread, but autograd can write to
this cache from a background thread, so whatever autograd writes
will leak.

With some offline discussion we decided that a global cache is a
practical way to deal with this, and the performance impact of the
lock should be negligible.

Test Plan:

I don't have a local repro of the original issue, need to look into how to get
that.

A toy example
(https://gist.github.com/vkuzo/0d6318fe7f7cb1c505e370cd5c1a643b)
does cache clearing as expected on forward and backward pass.

local testing:
```
python test/test_cuda.py -k autocast
python test/test_autocast.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86492
Approved by: https://github.com/ezyang
2022-10-31 16:12:37 +00:00
34f523b221 [FSDP] Enable use_orig_params=True test (#88034)
I accidentally committed the `use_orig_params` PR with this test disabled. This PR simply re-enables it. It passes locally, so if CI is green, then this is an easy land.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88034
Approved by: https://github.com/H-Huang
2022-10-31 14:28:51 +00:00
df1cc0ef47 [Vulkan] Add Vulkan Rewrite to Transfer Inputs and Outputs to Vulkan and CPU Backends Respectively (#87432)
With this change, we don't have to manually invoke transferring input and output backends when we run vulkan models.

Graph rewrite code based off of:
- 32efff45ba (diff-a473bddb458dc24225866a45092d6eca064eddd256245d93020e48e216eee4d5R160-R179)

Differential Revision: [D39519168](https://our.internmc.facebook.com/intern/diff/D39519168/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39519168/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87432
Approved by: https://github.com/mcr229, https://github.com/digantdesai
2022-10-31 14:18:45 +00:00
bc68625151 [Vulkan] Add support for Optimization Blocklist to Vulkan Rewrite (#87431)
Optimization Blocklist will be used in a future diff (D40315730) to make the rewrite to transfer input/output backends optional

Differential Revision: [D40315729](https://our.internmc.facebook.com/intern/diff/D40315729/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87431
Approved by: https://github.com/mcr229, https://github.com/digantdesai
2022-10-31 14:15:51 +00:00
f717986f93 .gitignore log files (#88085)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88085
Approved by: https://github.com/albanD
2022-10-31 13:40:30 +00:00
8ea19c802e Make IValue::unsafeToTensorImpl a little less unsafe. (#88043)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88043
Approved by: https://github.com/anjali411, https://github.com/albanD
2022-10-31 13:20:19 +00:00
e238752e20 Simplify magic method definition code. (#88017)
It turns out sym_float (and the hypothetical sym_int) can
be defined in the same way as conventional magic methods.
Do so.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88017
Approved by: https://github.com/albanD
2022-10-31 13:19:56 +00:00
2a47b10780 Get the magic method try reverse protocol correct (#88030)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88030
Approved by: https://github.com/anjali411, https://github.com/albanD
2022-10-31 13:19:56 +00:00
12dd877395 Fix all references to torchdynamo from the merge (#87731)
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87731
Approved by: https://github.com/yanboliang, https://github.com/ezyang, https://github.com/anijain2305, https://github.com/jansel
2022-10-31 06:51:07 +00:00
496acb6602 Add fake tensor files to ciflow/inductor (#88052)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88052
Approved by: https://github.com/anijain2305
2022-10-31 05:35:54 +00:00
6735bf21c7 [test_nn] split convolution tests from test_nn (#87474)
Ref #63085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87474
Approved by: https://github.com/albanD
2022-10-31 04:42:45 +00:00
46ce92713d fix github bug issue 87552 (#88059)
Fixes #87552
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88059
Approved by: https://github.com/jgong5, https://github.com/ngimel
2022-10-31 04:40:54 +00:00
e24ce484ed Use scaled_dot_product_attention within attention.cpp (#87312)
# Summary
Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA.

cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87312
Approved by: https://github.com/cpuhrsch
2022-10-31 04:06:31 +00:00
d13f1e6ab4 Add sequence number support for UCC (#85047)
Add sequence number support for UCC, mostly following format of ProcressGroupNCCL.
Pass new test: `test_all_gather_object_subgroup`
Add skips for gather tests: `test_gather_object` and `test_gather_object_subgroup`

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85047
Approved by: https://github.com/kwen2501
2022-10-31 03:56:55 +00:00
9642a7c2f6 [ONNX] Fix get wrong summary of the docstring in torch.onnx._deprecation.deprecated (#87194)
The summary of the deprecated function could be multi-line. Therefore the code below:
9ac2a06acf/torch/onnx/_deprecation.py (L45)
should be adjusted to

```python
summary_and_body = docstring.split("\n\n", 1)
```
Otherwise, the multi-line summary will be separated wrongly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87194
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-10-31 03:00:30 +00:00
d67b2edec3 [dynamo][dashboard] minor fixes for a clean Dashboard (#88056)
* better check for cold start latency
* sort on inductor column for better readability.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88056
Approved by: https://github.com/ngimel
2022-10-31 02:30:29 +00:00
9109ecf914 Even "nvcc not found" should be commented out (#87959)
Summary: Even "nvcc not found" should be commented out in minifier_launcher.py, cause there could be a case that PyTorch/minifier can find cuda path but nvcc is not explicitly included in env variable like PATH.

Differential Revision: D40790023

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87959
Approved by: https://github.com/anijain2305, https://github.com/jianyuh
2022-10-30 18:22:17 +00:00
1b575782a0 [dynamo][benchmarks] use fresh inductor cache and raise batch size wherever possible (#88044)
cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88044
Approved by: https://github.com/ngimel
2022-10-30 17:10:17 +00:00
e7b854fae9 [BE] Do not package caffe2 in wheel (#87986)
If PyTorch is build without caffe2 integration, do not package unusable
.py files/headers

Same is true about functorch - don't package it unless building with `functorch` (although, I wonder if we should remove this option at some point in the future)

Followup after https://github.com/pytorch/builder/pull/1181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87986
Approved by: https://github.com/seemethere
2022-10-30 04:31:45 +00:00
65e7719599 [vision hash update] update the pinned vision hash (#87948)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87948
Approved by: https://github.com/pytorchbot
2022-10-30 03:02:57 +00:00
621158cd7f [BE] Do not assign string literal to char * (#87949)
Not sure, what I was thinking when writing something like:
```
auto foo = std::getenv("BAR");
if (!foo) {
   foo = "baz";
}
```
as `std::getenv` return `char *` (i.e. mutable string), but string literals are immutable. (i.e. `const char *`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87949
Approved by: https://github.com/kit1980
2022-10-30 01:04:55 +00:00
59001d05b4 [Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)
Fixes #ISSUE_NUMBER

cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809
Approved by: https://github.com/ngimel
2022-10-29 20:36:20 +00:00
bc64999b83 Revert "Unify meta tensor and fake tensor converter conversion (#87943)"
This reverts commit baa715e790921e6498861e59556035de1a481cc5.

Reverted https://github.com/pytorch/pytorch/pull/87943 on behalf of https://github.com/kit1980 due to Broke several inductor tests
2022-10-29 18:39:28 +00:00
e4a8661ab8 torchdynamo and xla integration (#87741)
# Motivation
- torchdynamo and torchxla uses different strategies to be a sound graph capture technique. The former relies on guards; the latter relies on retracing
- guard system is quite low overhead but torchxla tracing overhead is quite high

The main idea is to leverage guard system in torchdynamo to avoid retracing in torchxla so that
- we can integration torchdynamo with XLA
- we reduce or even completely avoid tracing overhead of torchxla

# Technique details
## XLA baseline
We found that different frameworks do not generate numerically identical results for the SAME model with the SAME input. By default, torchdynamo uses eager as baseline so the model will run with PyTorch. It would be tricky to compare a model running on XLA with this baseline: it's hard to check correctness. To make the comparison easier, we add a flag `--use-xla-baseline`. When it's enabled, the baseline will be run on XLA.

## New dynamo backends added
We add 2 new dynamo backends torchxla_trivial and trochxla_trace_once to control the optimization targets.

torchxla_trivial simply moves inputs/model parameters to XLA and run the model on XLA. There is tracing overhead for each run. We should expect that result to be mostly neutral compared to the XLA baseline.

torchxla_trace_once only traces once during AOT compiling time. Here are the steps:
1. dynamo capture guards and the subgraph
2. torchxla_trace_once backend trace the graph with torchxla, lowering the graph and record a hash of the graph for later lookup
3. at inference time, the hash is used directly to lookup the optimized graph and run it.

# Limitations
We can not handle LTC/torchxla fall back right now. If a op misses LTC kernel, we raise and exception and that will results in dynamo fallback (or try another compiler). People have brainstormed the idea of graph breaking and stitching the subgraphs together. But maybe it's easier to add those missing LTC kernels for those models.

# Results
The models we tested are those not causing LTC fallback. We run the tests on **GPU**. We see **1.38x** geomean speedup for trochxla_trace_once  and torchxla_trivial is mostly neutral as expected.
```
| Model                   |   XLA (trace once) |   XLA (trace everytime) |
+=========================+====================+=========================+
| resnet18                |            1.346   |                 1.045   |
+-------------------------+--------------------+-------------------------+
| resnet50                |            1.153   |                 1.007   |
+-------------------------+--------------------+-------------------------+
| resnext50_32x4d         |            1.381   |                 1.039   |
+-------------------------+--------------------+-------------------------+
| alexnet                 |            1.045   |                 1.018   |
+-------------------------+--------------------+-------------------------+
| mobilenet_v2            |            1.562   |                 1.021   |
+-------------------------+--------------------+-------------------------+
| mnasnet1_0              |            1.303   |                 1.069   |
+-------------------------+--------------------+-------------------------+
| squeezenet1_1           |            1.278   |                 1.025   |
+-------------------------+--------------------+-------------------------+
| vgg16                   |            1.076   |                 1.008   |
+-------------------------+--------------------+-------------------------+
| BERT_pytorch            |            2.224   |                 0.978   |
+-------------------------+--------------------+-------------------------+
| timm_vision_transformer |            1.81    |                 1.025   |
+-------------------------+--------------------+-------------------------+
| geomean                 |            1.38101 |                 1.02324 |
+-------------------------+--------------------+-------------------------+
```

The speedup is similar to what we see from previous work for LTC's TorchScript backend (we see 1.40 geomean speedup there):
https://docs.google.com/presentation/d/1G09X8v41u_cLKLtSdf7v6R8G19-iZTPcW_VAdOnvYBI/edit#slide=id.g11bf989cb6b_1_5

# Next steps
- Use AOT autograd to enable training
- Share results on XLA devices
- Do more extensive tests on torchbench models

Example command
```
GPU_NUM_DEVICES=1 python benchmarks/dynamo/torchbench.py --randomize-input --performance --use-xla-baseline --only resnet18 --backend=torchxla_trace_once
```

Thanks @JackCaoG from torchxla team to help debugging various perf issues and merging the torchxla PR! That's super critical for us to get the results above. torchxla side PR: https://github.com/pytorch/xla/pull/4119

topic: not user facing

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @jansel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87741
Approved by: https://github.com/wconstab
2022-10-29 17:52:26 +00:00
6cd25eb6de Use TORCH_CHECK instead of inappropriate CUDA_KERNEL_ASSERT (#87714)
`CUDA_KERNEL_ASSERT` should only be used inside kernels; switch these bad usages to `TORCH_CHECK`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87714
Approved by: https://github.com/ezyang
2022-10-29 17:48:23 +00:00
384b84d6a6 [BE] Upload GHA artifacts to S3 (#87827)
This is exclusively used by macOS, ROCM (and any other future workflows) that don't have direct access to S3 to upload their artifacts

### Testing

Running the script locally with the personal GITHUB_TOKEN:

```
python3 -m tools.stats.upload_artifacts --workflow-run-id 3342375847 --workflow-run-attempt 1 --repo pytorch/pytorch

Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb
Downloading sccache-stats-macos-12-py3-arm64-runattempt1-9155493770
Downloading sccache-stats-macos-12-py3-lite-interpreter-x86-64-runattempt1-9155493303
Downloading sccache-stats-macos-12-py3-x86-64-runattempt1-9155493627
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-arm64-runattempt1-9155493770 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-arm64-9155493770
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-lite-interpreter-x86-64-runattempt1-9155493303 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-lite-interpreter-x86-64-9155493303
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/sccache-stats-macos-12-py3-x86-64-runattempt1-9155493627 to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/sccache-stats-macos-12-py3-x86-64-9155493627
Downloading test-jsons-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip
Downloading test-jsons-runattempt1-test-default-1-2-macos-12_9155944815.zip
Downloading test-jsons-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip
Downloading test-jsons-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip
Downloading test-jsons-runattempt1-test-default-2-2-macos-12_9155944892.zip
Downloading test-jsons-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-linux.rocm.gpu_9155913429.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-macos-12_9155944815.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-1-2-macos-m1-12_9155888061.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-linux.rocm.gpu_9155913500.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-macos-12_9155944892.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-jsons-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-jsons-test-default-2-2-macos-m1-12_9155888182.zip
Downloading test-reports-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip
Downloading test-reports-runattempt1-test-default-1-2-macos-12_9155944815.zip
Downloading test-reports-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip
Downloading test-reports-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip
Downloading test-reports-runattempt1-test-default-2-2-macos-12_9155944892.zip
Downloading test-reports-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-linux.rocm.gpu_9155913429.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-macos-12_9155944815.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-1-2-macos-m1-12_9155888061.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-linux.rocm.gpu_9155913500.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-macos-12_9155944892.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/test-reports-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/test-reports-test-default-2-2-macos-m1-12_9155888182.zip
Downloading usage-log-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip
Downloading usage-log-runattempt1-test-default-1-2-macos-12_9155944815.zip
Downloading usage-log-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip
Downloading usage-log-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip
Downloading usage-log-runattempt1-test-default-2-2-macos-12_9155944892.zip
Downloading usage-log-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-linux.rocm.gpu_9155913429.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-linux.rocm.gpu_9155913429.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-macos-12_9155944815.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-macos-12_9155944815.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-1-2-macos-m1-12_9155888061.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-1-2-macos-m1-12_9155888061.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-linux.rocm.gpu_9155913500.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-linux.rocm.gpu_9155913500.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-macos-12_9155944892.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-macos-12_9155944892.zip
Upload /private/var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpxl6d7kcb/usage-log-runattempt1-test-default-2-2-macos-m1-12_9155888182.zip to s3://gha-artifacts/pytorch/pytorch/3342375847/1/artifact/usage-log-test-default-2-2-macos-m1-12_9155888182.zip
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87827
Approved by: https://github.com/clee2000
2022-10-29 17:40:07 +00:00
d9b6e41da9 Add composable activation checkpointing (#87664)
This is a composable activation checkpointing API. Unlike functional
activation checkpointing APIs, this one does not require changing
model source code. Unlike ``nn.Module`` wrapper activation checkpointing
APIs, this one does not modify model structure or fully-qualified names
either. Under the hood, it registers activation checkpointing logic as pre-
and post-forward hooks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87664
Approved by: https://github.com/zhaojuanmao
2022-10-29 17:35:58 +00:00
19171a21ee Make barrier blocking in UCC (#86961)
Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86961
Approved by: https://github.com/kwen2501
2022-10-29 16:33:18 +00:00
baa715e790 Unify meta tensor and fake tensor converter conversion (#87943)
Meta tensor does a lot of work to make sure tensors "look" similar
to the original parts; e.g., if the original was a non-leaf, meta
converter ensures the meta tensor is a non-leaf too.  Fake tensor
destroyed some of these properties when it wraps it in a FakeTensor.

This patch pushes the FakeTensor constructor into the meta converter
itself, so that we first create a fake tensor, and then we do various
convertibility bits to it to make it look right.

The two tricky bits:

- We need to have no_dispatch enabled when we allocate the initial meta
  tensor, or fake tensor gets mad at us for making a meta fake tensor.
  This necessitates the double-callback structure of the callback
  arguments: the meta construction happens *inside* the function so
  it is covered by no_dispatch

- I can't store tensors for the storages anymore, as that will result
  in a leak.  But we have untyped storage now, so I just store untyped
  storages instead.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87943
Approved by: https://github.com/eellison, https://github.com/albanD
2022-10-29 15:01:07 +00:00
4210cebc16 [ONNX] Add internal node kind parsing (#87638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87638
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-10-29 11:51:23 +00:00
cb05a4da39 [ONNX] Parametrized Avgpool2D test to have all test combinations (#87893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87893
Approved by: https://github.com/BowenBao
2022-10-29 11:45:28 +00:00
f2ae459311 [ONNX] Disable ONNX ceil_mode and count_include_pad to aligntorch ceil_mode results in corner case (#87892)
ONNX and PyTorch has different equation on pooling and different strategy on ceil_mode, which leads to discrepancy on corner case (#71549 ).
Specifically, PyTorch avereage pooling is not following [the equation on documentation](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html), it allows sliding window to go off-bound instead, if they start within the left padding or the input (in NOTE section). More details can be found in #57178.

This PR changes avgpool in opset 10 and 11 back the way as opset 9, which it stops using ceil_mode and count_include_pad  in onnx::AveragePool

A comprehensive test for all combinations of parameters can be found in the next PR. #87893
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87892
Approved by: https://github.com/BowenBao
2022-10-29 11:35:10 +00:00
c810489dd9 Cleanup macos common conda installation (#87816)
The conda dependencies have all been installed for `_mac-test` in https://github.com/pytorch/pytorch/pull/87541.  I missed the same step for `_mac-build` and `_mac-test-mps` workflows, so both are also updated here. Note that arm64 is cross-compiled from x86, so the env file needs to be set explicitly in that case

After this one, I have a WIP PR to consolidate macos pip dependencies next
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87816
Approved by: https://github.com/ZainRizvi
2022-10-29 08:43:45 +00:00
53fea90547 Store usage log on GitHub when S3 is not available (#87947)
It turns out that we haven't uploaded the usage log to GitHub when S3 is not available (macos, rocm), for example, https://github.com/pytorch/pytorch/actions/runs/3325822440#artifacts only includes test-report, test-json, sccache stats, and build artifacts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87947
Approved by: https://github.com/clee2000
2022-10-29 08:34:13 +00:00
d3c01c722d Fix pybind11 problems with c10::SymInt unregistered (#88011)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88011
Approved by: https://github.com/weiwangmeta, https://github.com/albanD
2022-10-29 07:55:45 +00:00
e667c00656 [FSDP()][2/N] Refactor training state (#87916)
This PR actually has meaningful changes. We stratify `TrainingState` into two levels: one is per FSDP instance and one is per `FlatParamHandle`/`FlatParameter`.
- At the FSDP instance level, we only care about `IDLE`, FSDP computation (i.e. `FORWARD_BACKWARD`), or `SUMMON_FULL_PARAMS`. These dynamically modify behavior (e.g. `summon_full_params()` forces full precision).
- At the `FlatParamHandle` level, we care about the training state for invariants and debugging. Hence, we keep `IDLE`, `FORWARD`, `BACKWARD_PRE`, `BACKWARD_POST`, and `SUMMON_FULL_PARAMS`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87916
Approved by: https://github.com/mrshenli
2022-10-29 06:50:30 +00:00
cbc9faebfe [FSDP()][1/N] Start refactoring FSDP root pre-forward (#87915)
Welcome! This PR starts the refactoring journey.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87915
Approved by: https://github.com/mrshenli
2022-10-29 06:50:30 +00:00
edd6cf9996 Revert "[ONNX] Deprecate operators.py (#87798)"
This reverts commit 88eff1072290177221e7a09d792f7f135b4c83ca.

Reverted https://github.com/pytorch/pytorch/pull/87798 on behalf of https://github.com/weiwangmeta due to breaking internal builds see D40797126
2022-10-29 06:48:12 +00:00
e3e84830aa [ONNX] Move all torch.onnx.export related tests to test/onnx (#87292)
Moving torch.onnx.export related tests to test/onnx integrates ONNX tests to the same CI machine, so the testing environment can be better managed.

Fixes https://github.com/pytorch/pytorch/issues/87320
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87292
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/kit1980
2022-10-29 05:31:30 +00:00
1dad051b05 Move workspace related functions to separate file (#87651)
Move workspace related functions to separate file

Test Plan: Existing tests

Differential Revision: D40657708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87651
Approved by: https://github.com/malfet
2022-10-29 04:52:01 +00:00
0cf572ff6c [C10D][BE] Add exception handlers to c10d collectives function (#87643) (#87988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87643

1. Add a decorator function exception_handlers to  c10d collectives.
2. Update test(torch/distributed/distributed_c10d.py) to include mp tests for exception_handler.

```
python3 test/distributed/test_c10d_error_logger.py
```

Test Plan: Test in OSS.

Reviewed By: H-Huang

Differential Revision: D40281632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87988
Approved by: https://github.com/H-Huang
2022-10-29 04:38:34 +00:00
20e16c013f Allow caffe2 to build with fbcode/mode/mac (#87293)
Summary: The Mac contbuild builds under the `fbcode/mode/mac` which caffe2 fails to build under. This is due to that build mode enforcing protobuf v3. The caffe2 targets already account for this issue under `arvr` build modes by swapping out protobuf dependencies. They don't account for the same issue under `fbcode/mode/mac`. This diff fixes that by checking for `is_fbcode_mac` in these situations (in addition to `arvr`).

Test Plan:
```
buck build --flagfile fbsource//fbcode/mode/mac fbsource//xplat/caffe2/...
```

Reviewed By: kimishpatel

Differential Revision: D39552724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87293
Approved by: https://github.com/kimishpatel
2022-10-29 04:20:57 +00:00
9835413009 Fake Tensor For (Conv) Propagation (#87641)
Resubmitting https://github.com/pytorch/pytorch/pull/87302 so it can be ghstack'd with the pr below.

Incorrect strides in any meta impl would lead to runtime assertion errors for fallback kernels, so start by just enabling it for conv.

Replaces https://github.com/pytorch/pytorch/pull/87588.

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87641
Approved by: https://github.com/jansel
2022-10-29 04:14:01 +00:00
14d5f139d2 Fix typos under benchmarks, test, and tools directories (#87975)
This PR fixes typos in `.md` files under benchmarks, test, and tools directories
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975
Approved by: https://github.com/kit1980
2022-10-29 01:26:17 +00:00
18f3db2963 Fix functorch tests (#87914)
Test Plan: - Run tests

Differential Revision: D40777145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87914
Approved by: https://github.com/Chillee, https://github.com/osalpekar
2022-10-29 01:21:55 +00:00
af0c339f00 Disable slow-gradcheck tests (#88008)
Disable because slow-gradcheck tests take > 4 hrs and time out. Will need to figure out if and how to re-enable later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88008
Approved by: https://github.com/seemethere, https://github.com/huydhn
2022-10-29 00:23:50 +00:00
785054d3a9 [CI] Report build errors in Windows build step (#88001)
Should make failures like https://github.com/pytorch/pytorch/actions/runs/3346715682/jobs/5543900889 much more debuggable

P.S. I don't know how to write batch, just hope its going to work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88001
Approved by: https://github.com/seemethere
2022-10-28 23:59:49 +00:00
1eba3f220e Fix bugs found by static analysis (#85705)
These PR fixes a number of bugs found by Svace static analyzer:

1. DEREF_AFTER_FREE at qnnpack_utils.h:
Pointer '&convolution->zero_buffer' is dereferenced at qnnpack_utils.h:258 after the referenced memory was deallocated at operator-delete.c:25 by passing as 1st parameter to function 'pytorch_qnnp_delete_operator' at qnnpack_utils.h:251.
2. DEREF_AFTER_NULL at impl.cpp:
After having been compared to NULL value at impl.cpp:1892, pointer 'schema' is passed as 2nd parameter in call to function 'c10::operator<<' at impl.cpp:1921, where it is dereferenced at function_schema_inl.h:13.
3. DEREF_OF_NULL  at stmt.h:
After having been compared to NULL value at stmt.h:744, pointer 'body->_M_ptr' is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at stmt.h:745, where it is dereferenced at exceptions.h:67.
4. DEREF_OF_NULL  at loopnest.h:
Pointer 'f->ptr' that can have only NULL value (checked at loopnest.cpp:1482), is passed in call to function 'torch::jit::tensorexpr::malformed_input::malformed_input' at loopnest.cpp:1483, where it is dereferenced at exceptions.h:67.
This is the same error as 3: forwarding a nullptr to malformed_input().
4. TAINTED_INT.LOOP in python_arg_parser:
Integer value 'this->size' obtained from untrusted source at python_arg_parser.cpp:118 without checking its bounds is used as a loop bound at python_arg_parser.cpp:698 by calling function 'torch::FunctionParameter::set_default_str' at python_arg_parser.cpp:133.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85705
Approved by: https://github.com/kit1980
2022-10-28 23:51:55 +00:00
376acf7625 Add 'share_from_this' to 'torch::jit::Graph' (#87343)
Avoid passing raw pointer of 'torch::jit::Graph' to python. Otherwise, it will corrupt the
`internals::registered_instance` of pybind11, caching a holder for python w.r.t the raw
pointer of 'torch::jit::Graph', while not increasing the use count of the existing shared_ptr.

The behavior afterwards is random and probably undefined.
Most of the time it works, if the holder is deallocated timely on python side, and the
cache then cleared from `internals::registered_instance`. Things are back to normal.
Otherwise, it fails with either segfault or a runtime error of message "Unable to cast
from non-held to held instance". One of such scenarios is normally and correctly
returning a shared_ptr of that 'torch::jit::Graph' to python. Pybind finds the holder via
cache. Due to this, the shared_ptr use_count will not increase. If there is no other use
on C++ side, the graph will be freed, while python still has access, via the holder created
previously.

@t-vi had a great analysis and solution to this exact problem at #51833 which I hope
I had seen before debugging this issue... ~~I'm building the PR based on the original
commit. @t-vi please let me know if you'd prefer otherwise.~~ Sending the PR separately
due to CLA issues.

Need to check in CI if adding `enable_shared_from_this` breaks other stuff.

Fixes #51833, and CI issues in #87258, #86182.

cc @malfet, @kit1980 for changes on JIT IR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87343
Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/malfet
2022-10-28 23:51:44 +00:00
ecf277abec [quant][improvement] Check the fixedqparam op qconfig based on backend_config (#87425)
Summary:
Previously we hardcoded the supported observers for fixedqparam ops, this PR changes that to take the information from BackendConfig,
this allows users to customize the support for fixed qparam ops

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_change_backend_config_for_fixed_qparam_ops

Reviewers:

Subscribers:

Tasks:

Tags:

unlinked from diff since it's too hard to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87425
Approved by: https://github.com/andrewor14
2022-10-28 23:38:40 +00:00
c3c817c972 Revert "ci: Switch merge / revert flow to our own infra" (#88016) 2022-10-28 15:12:31 -07:00
a2ffc3be97 [AC] Add trailing "." to _CHECKPOINT_PREFIX like FSDP (#87951)
This is for consistency with FSDP.
- `_FSDP_WRAPPED_MODULE` and `_CHECKPOINT_WRAPPED_MODULE` are exactly the wrapped module variable name, meaning you can call `getattr(module, _FSDP_WRAPPED_MODULE)` or `getattr(module, _CHECKPOINT_WRAPPED_MODULE)`.
- `_FSDP_PREFIX` and `_CHECKPOINT_PREFIX` include the trailing `"."` and are only used for FQNs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87951
Approved by: https://github.com/zhaojuanmao
2022-10-28 22:05:29 +00:00
4faf086e5f Update build scripts for ninja and ROCm5.3 install (#87505)
cc @jeffdaily @sunway513 @ROCmSupport
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87505
Approved by: https://github.com/seemethere
2022-10-28 22:05:12 +00:00
349ad23ffb ci: Switch merge / revert flow to our own infra (#88009) 2022-10-28 14:37:55 -07:00
9691ba2dbd Remove excess exception logging for minifier, cleanup backend failure exception format (#87537)
Fixes https://github.com/pytorch/torchdynamo/issues/1376

Ensures exceptions are printed only in one place, once.

implements some of the ideas from https://github.com/pytorch/torchdynamo/issues/1754
- Attaches a field to the exception which indicates that it's minified, a usage message is printed if this field is present

cc @jansel @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87537
Approved by: https://github.com/anijain2305
2022-10-28 21:33:55 +00:00
1c37119a1f [FSDP] New fix for composing with other module wrappers (#87950)
We change `.module` to pass through `ActivationWrapper` directly to the inner wrapped module. This should fix the state dict issues.

Given the invariant that `.module` always returns the inner wrapped module, FSDP always registers the `FlatParameter` on the inner wrapped module, regardless of if there is an intermediate `ActivationWrapper` or not. This avoids casing on whether `ActivationWrapper` is added before or after FSDP construction.

This PR removes the added unit test in `test_fsdp_misc.py` for changing the wrapped module because I would rather not complicated `_lazy_init()` logic just to support that kind of adversarial behavior. The user should not be swapping out the wrapped module arbitrarily or deleting the `FlatParameter`. I mainly had those tests to make sure that all branches of the code I added was correct.

Differential Revision: [D40799961](https://our.internmc.facebook.com/intern/diff/D40799961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87950
Approved by: https://github.com/zhaojuanmao
2022-10-28 21:11:40 +00:00
c2c269c10a Convert MetaConverter's tensor memo into a weak value dictionary. (#87911)
This is in preparation for unifying fake tensor converter and meta converter's memo tables.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87911
Approved by: https://github.com/eellison
2022-10-28 21:05:13 +00:00
e72962a34d Force people to call from_meta_and_device directly (#87903)
It was pretty hard to tell at call site if I was doing device meta
convert or not.  This gets rid of the "dual" API and forces people
to call the method manually for the device case.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87903
Approved by: https://github.com/eellison, https://github.com/albanD
2022-10-28 21:05:13 +00:00
ab8fbd26f8 Advance nightly docker to 11.6 (#87858)
Fixes following:
https://github.com/pytorch/pytorch/actions/runs/3242695506/jobs/5316334351
crash in Docker builds introduced by: #82682

The PR seems to introduce some changes not compatible with cuda 11.3 which is used by our Docker builds

This is a reland of original pr: https://github.com/pytorch/pytorch/pull/86941 (Created this new PR to start fresh)
Which was reverted because conda install, installed wrong version of pytorch. It installed pytorch for cuda 11.3 still rather then 11.6

This should be fixed now with Release 1.13
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87858
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/izaitsevfb
2022-10-28 19:55:33 +00:00
c5cb6ec066 Allow 64bit indexing for channels-last upsample2d on CUDA (#87901)
#81665

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87901
Approved by: https://github.com/ngimel
2022-10-28 19:33:42 +00:00
fb64f7b804 [Profiler][Trivial] Move ID assignment code to data_flow.cpp (#87670)
ID assignment has become a very complex facet of the profiler. The existing code has grown organically as I've discovered various refinements and has become very difficult to understand or reason about. (With more complexity coming in https://github.com/pytorch/pytorch/pull/87133)

I want to take a step back and add some structure and additional comments to the ID assignment algorithm. Before I do, however, it's time to move it out of `collection.cpp` to a dedicated data flow file.

Differential Revision: [D40666360](https://our.internmc.facebook.com/intern/diff/D40666360/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40666360/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87670
Approved by: https://github.com/slgong-fb
2022-10-28 18:40:18 +00:00
8d395ec6bc [Profiler][Trivial] Add hashing struct for pairs and tuples. (#87668)
There is a fairly simple and commonly used hash_combine in c10/util; however in order to use it in a map we need to wrap it in a hashing struct. By defining template functions we also get recursive unpacking for free. (A later PR will want to hash a `tuple<tuple<T0, T1>, tuple<T0, T1>>`)

Differential Revision: [D40666359](https://our.internmc.facebook.com/intern/diff/D40666359/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87668
Approved by: https://github.com/slgong-fb
2022-10-28 18:40:18 +00:00
d13b6781d8 Revert "[fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87257)"
This reverts commit 58650835bb91d927623e6bff5cc4844fbcad6368.

Reverted https://github.com/pytorch/pytorch/pull/87257 on behalf of https://github.com/weiwangmeta due to breaking internal builds/BC-breaking change
2022-10-28 17:55:19 +00:00
fc21b9db23 Use Eager Code To Determine Conv Layout (#87305)
The logic for determine conv backend and therefore output striding is very complex. It depends on build settings, input striding/contiguity, sizes, etc. Eventually we should port that logic to the meta impl for dynamic shapes but that will require a lot more work and keeping the implementations in sync. See https://github.com/pytorch/torchdynamo/issues/1701

This is a prerequisite to removing the inductor conv stride propagation and more general fake tensor for inductor propagation. In that PR, the meta impls for cpu conv give incorrect striding which led to test failures (https://github.com/pytorch/pytorch/pull/87083).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87305
Approved by: https://github.com/ezyang
2022-10-28 16:37:04 +00:00
1bc0e923bb add special case for power of 0.5 (#87912)
Workaround for https://github.com/pytorch/torchdynamo/issues/1775, and calling sqrt is better in any case, but `libdevice.pow` still for some reason doesn't work if both arguments are scalars

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @mreso, can you please check if that takes you further with diffusers

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87912
Approved by: https://github.com/desertfire
2022-10-28 16:09:25 +00:00
35c611d30f Add mem efficient backend flag (#87946)
# Summary
Add in a torch.backends.cuda flag and update context manager to pic between the three implementations of the scaled_dot_product_attention.

cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87946
Approved by: https://github.com/cpuhrsch
2022-10-28 15:51:10 +00:00
89fd451934 Fix codeowner errors (#87954)
Error message: "Unknown owner: make sure @mingzhe09088 exists and has
write access to the repository."
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87954
Approved by: https://github.com/wangkuiyi
2022-10-28 15:19:09 +00:00
8a9aca7b8d Reland 2 Many symintifications (#87604) (#87980)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87980
Approved by: https://github.com/ezyang
2022-10-28 13:40:11 +00:00
ce3e0e9856 Add state to distributed composable API (#87838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87838
Approved by: https://github.com/yhcharles
2022-10-28 13:31:40 +00:00
b192e7e415 Support non-contiguous NestedTensors for elementwise ops (#87888)
Enables benchmarking of math path of sdp kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87888
Approved by: https://github.com/drisspg
2022-10-28 11:26:17 +00:00
f150e70ca2 add the function specialization for promote with ITensorListRef (#87756)
Fixes [#87684](https://github.com/pytorch/pytorch/issues/87684)
It's due to a new tensor list type is introduced as `ITensorListRef`.  We need the function specialization for `prioritize` and `cached_cast` for this new tensor list type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87756
Approved by: https://github.com/jgong5, https://github.com/ezyang
2022-10-28 10:30:30 +00:00
166b5d3e7c Revert "[EZ] Fix simple bug in torchdynamo (#87821)"
This reverts commit ce7fcab9bdf61a34bc56b7cd45a882e4ad6ba175.

Reverted https://github.com/pytorch/pytorch/pull/87821 on behalf of https://github.com/kit1980 due to Broke many dynamo tests https://github.com/pytorch/pytorch/actions/runs/3341984303/jobs/5534381456
2022-10-28 06:11:42 +00:00
78b406932f Add me to reviewers of composable API changes (#87891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87891
Approved by: https://github.com/mrshenli
2022-10-28 05:11:39 +00:00
1da5aeb97b [dynamo] Error when user nests FX with dynamo (#87797)
Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

Here, we detect the error early and exit with a nicer msg. Also add a
config option to just no-op dynamo (which need to unblock internal
enablement).

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel
2022-10-28 04:59:08 +00:00
07f7c4615b [MKLDNN] Replace pooling algorithm pooling_avg with pooling_avg_exclude_padding for future oneDNN upgrades (#87851)
**Description**
Replace pooling algorithm `pooling_avg` with `pooling_avg_exclude_padding` in implementation of mkldnn pooling. It's only a change of names, not algorithm. The former is an alias of the latter and it will be removed in future oneDNN library upgrades.
This change has no effect on functionality or performance.

**Validation**
Covered by UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87851
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper
2022-10-28 04:58:55 +00:00
23b79e6f48 Update CMakeLists.txt (#87030)
Fix Caffe2_CPU_INCLUDE with Caffe2_GPU_INCLUDE. The expanding parent scope should be with the same variable name. The compilation in certain build configurations is corrected with this fix.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87030
Approved by: https://github.com/kit1980
2022-10-28 04:56:40 +00:00
daff5d3556 Fix typos under caffe2 directory (#87840)
This PR fixes typos in `.md` files under caffe2 directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87840
Approved by: https://github.com/kit1980
2022-10-28 04:53:36 +00:00
e8a97a3721 FakeTensorMode and Prims.add/sub/mul/div support scalar only inputs (#87759)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87759
Approved by: https://github.com/ngimel, https://github.com/mruberry, https://github.com/eellison
2022-10-28 04:34:25 +00:00
d47ffecbe4 [dynamo] relax fake tensor restriction with assume_constant_result (#87895)
This works now because of https://github.com/pytorch/pytorch/pull/87091,
so don't error out anymore.

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87895
Approved by: https://github.com/tugsbayasgalan, https://github.com/voznesenskym
2022-10-28 04:05:06 +00:00
2e48b478e0 [ROCm] Use -rpath-link to fix libtinfo conflict (#83552)
Fixes issue building PyTorch for ROCm5.3 and above on Ubuntu20.04 because libtinfo6 from conda conflicts with the one from the distro causing symbol not found errors.

cc @jeffdaily @sunway513 @ROCmSupport
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83552
Approved by: https://github.com/malfet, https://github.com/pruthvistony
2022-10-28 03:50:43 +00:00
9c793b366f Move incorrectly placed closing curly brace of extern "C" block (#87853)
### Bug description
When `__SYCL_DEVICE_ONLY__` is defined, while building PyTorch, the output of the preprocessing step would not have the closing curly brace of the `extern "C"` block, as it has been incorrectly placed. Compilers don't seem to report an error or a warning for a missing closing brace of an `extern "C"` block.

### Impact of the bug
If `c10/macros/Macros.h` would be included in a C++ file, and after the preprocessing stage, if the preprocessed source file would have some templated code after `extern "C" {`, then, after compilation, linking might fail with the error `templates must have c++ linkage`). eg. https://stackoverflow.com/questions/61717819/template-with-c-linkage-error-when-using-template-keyword-in-main-cpp/61717908#61717908 (its answer also has a small snippet of code to reproduce such an issue).

### Solution in this PR
one-liner bug fix that rectifies the placement of closing curly brace (`}`), so that the `extern "C"` block ends properly when `__SYCL_DEVICE_ONLY__` is defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87853
Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet
2022-10-28 03:42:20 +00:00
13de4d2137 Meta OpInfo Test for stride correctness (#87849)
Failing test logs here
https://gist.github.com/SherlockNoMad/a7e132f3cb4152900f8a6d7df358c59e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87849
Approved by: https://github.com/eellison
2022-10-28 03:40:14 +00:00
8b4d95759c Revert "Many symintifications (#87604)"
This reverts commit 777e6a2c5100f3274cff1bcf7e47ccbe1a651927.

Reverted https://github.com/pytorch/pytorch/pull/87604 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-10-28 03:00:11 +00:00
2cb7c3f865 [dynamo][benchmarks] Prepone Cold start setup (#87913)
Parallel compilation warms the Threadpool when we call `torch._dynamo.optimize()`. In current benchmarks, we were setting up the TRITON_CACHE_DIR much later. Because of this parallel compilation artifacts were not used and compilation latency improvements were not visible in dashboard. This PR just prepones the setup of TRITON_CACHE_DIR.

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87913
Approved by: https://github.com/wconstab
2022-10-28 02:41:13 +00:00
641d8e0e69 Revert "Enable mypy check for distributed.py, and fix type errors (#87543)"
This reverts commit 2cc624cd4318414905d2475432aee13db9031cc6.

Reverted https://github.com/pytorch/pytorch/pull/87543 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2022-10-28 02:20:25 +00:00
f967918411 [AC] Return None from apply_activation_checkpointing() (#87871)
`_recursive_wrap()` returns `Tuple[nn.Module, int]`, where the `nn.Module` is the in-place modified module and the `int` is the numel wrapped. In that sense, the return value is not meant to be publicly used. The `apply_activation_checkpointing()` docs already suggest that the function returns `None`, so this PR simply follows that.

**Test Plan**
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87871
Approved by: https://github.com/zhaojuanmao
2022-10-28 02:00:39 +00:00
81c4049f4d [Static Runtime] Move PrepackWeights to internal-only graph passes (#87799)
Summary:
The pass introduces an `fb::` operator and thus cannot be used in OSS.

The test failure was not exposed because the Static Runtime tests have been disabled in OSS for a while. The Dev Infra folks encountered this failure when re-enabling the tests.

Test Plan: Existing tests

Differential Revision: D40724547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87799
Approved by: https://github.com/huydhn
2022-10-28 01:28:34 +00:00
ce7fcab9bd [EZ] Fix simple bug in torchdynamo (#87821)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87821
Approved by: https://github.com/voznesenskym, https://github.com/jansel
2022-10-28 00:52:00 +00:00
fd27246c16 Fix decomposition for std (#87181)
The previous implementation was lacking a few features and incurred on a
pretty large error

cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87181
Approved by: https://github.com/ngimel, https://github.com/peterbell10
2022-10-28 00:50:29 +00:00
f21d0b310c Add decomposition for diagonal_scatter (#87282)
cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87282
Approved by: https://github.com/mruberry
2022-10-28 00:50:29 +00:00
9225f26176 [FSDP] Fix wrapped module changing after ctor (#87837)
Recently, I retired `FlattenParamsWrapper`, which meant that FSDP registers its `FlatParameter` on the wrapped module instead of the `FlattenParamsWrapper` instance. This is only relevant for `use_orig_params=False`.

If the user changes an FSDP instance's wrapped module after the FSDP constructor, then the `FlatParameter` is no longer registered on the wrapped module. This can cause issues for full state dict, which checks if the `FlatParameter` is currently registered as an early return condition for `rank0_only=True`.

The solution in this PR is to re-establish the wrapped module in `_lazy_init()`, de-registering from the old wrapped module and re-registering to the new wrapped module, where the assumption is that the user should not modify the module structure upon `_lazy_init()`.

The direct access to the private attribute `_parameters` from `nn.Module` is not ideal, but we already rely on it for the dynamic `FlatParameter` registration. The tradeoff is whether we want an additional `nn.Module` wrapper (`FlattenParamsWrapper`) and use `delattr` plus a singleton list to do the dynamic registration or we want to access `_parameters`. If this becomes a problem, we can work with Core team on a solution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87837
Approved by: https://github.com/zhaojuanmao
2022-10-28 00:43:18 +00:00
7a3afe61d2 Check all CUDA API calls for errors in caffe2/ (#81816)
Test Plan: Sandcastle

Differential Revision: D35194868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81816
Approved by: https://github.com/ezyang
2022-10-28 00:41:06 +00:00
3ece9fb45d Check all CUDA API calls for errors in torch/ (#81560)
Summary:
Original commit changeset: 0bb770d2cdb2

Original Phabricator Diff: D35194935 (79e5b053b6)

Differential Revision: D35291874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81560
Approved by: https://github.com/ezyang
2022-10-28 00:40:48 +00:00
4e3a0ff92e Update how inductor cpu tests are skipped on fbcode (#87867)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87867
Approved by: https://github.com/anijain2305
2022-10-28 00:33:54 +00:00
6cc4ae3d2d Revert "[Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)"
This reverts commit 369755f8ce1b043c88efbc50ee09c0258dec5162.

Reverted https://github.com/pytorch/pytorch/pull/87809 on behalf of https://github.com/kit1980 due to Broke trunk / cuda11.6-py3.10-gcc7-sm86 / test (default, 4, 4, linux.g5.4xlarge.nvidia.gpu), same error on pull.
2022-10-27 23:55:59 +00:00
cda0d5a57b Revert "[dynamo] Error when user nests FX with dynamo (#87797)"
This reverts commit a485528a7e4551461d57db3deb8b40c2acea08d2.

Reverted https://github.com/pytorch/pytorch/pull/87797 on behalf of https://github.com/kit1980 due to Broke linux-bionic-py3.7-clang9 / test (dynamo, 2, 2, linux.2xlarge), same error on pull
2022-10-27 21:16:58 +00:00
6ad3543a1b BE: Improve test_will_engine_execute_node unittest (#87806)
Adds the test from https://github.com/pytorch/pytorch/pull/86672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87806
Approved by: https://github.com/albanD
2022-10-27 21:13:08 +00:00
0f7df16c71 [doc] Add out-kwarg documentation to torch.where (#87870)
Fixes #87862

cc: @lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87870
Approved by: https://github.com/lezcano
2022-10-27 21:03:42 +00:00
46b16977d9 Reimplement Kaiser window (#87330)
Relates to #85366

- For reference follow #87082.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87330
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-27 21:01:01 +00:00
369755f8ce [Inductor] Enable Inductor unspec inputs test for different dtypes (#87809)
Fixes #ISSUE_NUMBER

cc @jansel @mlazos @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87809
Approved by: https://github.com/ngimel
2022-10-27 20:58:48 +00:00
1ff52225f1 Unify SymIntNode and SymFloatNode into SymNode (#87817)
This refactor was prompted by challenges handling mixed int/float
operations in C++.  A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
approach.

The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode.  This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods.  This has a number of
knock on effects.

- We no longer have C++ classes to bind to Python.  Instead, we take an
  entirely new approach to our Python API, where we have a SymInt/SymFloat
  class defined entirely in Python, which hold a SymNode (which corresponds
  to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
  it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
  when it goes into C++.  This implies a userland rename.

  In principle, it is also possible for the canonical implementation of SymNode
  to be written in C++, and then bound to Python with pybind11 (we have
  this code, although it is commented out.)  However, I did not implement
  this as we currently have no C++ implementations of SymNode.

  Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
  code needs to know how to find these classes.  Currently, this is done
  just by manually importing torch and getting the attributes.

- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
  takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
  __torch_dispatch__ works.

Some miscellaneous improvements:

- SymInt now has a constructor that takes SymNode.  Note that this
  constructor is ambiguous if you pass in a subclass of SymNode,
  so an explicit downcast is necessary.  This means toSymFloat/toSymInt
  are no more.  This is a mild optimization as it means rvalue reference
  works automatically.

- We uniformly use the caster for c10::SymInt/SymFloat, rather than
  going the long way via the SymIntNode/SymFloatNode.

- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
  functions, pretty sure this doesn't do anything.

- guard_int is now a free function, since to guard on an int you cannot
  assume the method exists.  A function can handle both int and SymInt
  inputs.

- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
  ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
  plain methods; this is to help avoid confusion between the two types.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-10-27 20:56:02 +00:00
2205f56f46 [LTC] Remove lazy::View (#87822)
Summary:
This is the first part to remove the whole view and aliasing infrastructure in LTC, which is deprecated in favor of functionalization. It mainly removes things that use lazy::View.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87822
Approved by: https://github.com/JackCaoG, https://github.com/antoniojkim, https://github.com/wconstab
2022-10-27 20:39:30 +00:00
83b381d34d [dynamo] add inductor runs w/o cudagraphs (#87847)
as title

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87847
Approved by: https://github.com/jansel
2022-10-27 19:49:29 +00:00
d2d0be9a76 fix typo in per sample grad test (#87790)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87790
Approved by: https://github.com/zou3519
2022-10-27 19:43:44 +00:00
b8b1d7be24 [dynamo] Add ao.nn to skipfiles inline allowlist (#87820)
Summary:

Allow torch.ao.nn module to be inlined

Test Plan:

Tested manually for https://github.com/pytorch/torchdynamo/issues/1737

Reviewers:

Subscribers:

Tasks:

Tags:

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx

Differential Revision: [D40768679](https://our.internmc.facebook.com/intern/diff/D40768679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87820
Approved by: https://github.com/jansel
2022-10-27 18:46:54 +00:00
a485528a7e [dynamo] Error when user nests FX with dynamo (#87797)
Today, this doesn't work and dynamo errors out in a very non-obvious way (see:
https://gist.github.com/suo/dde04830372ab51a4a34ea760f14200a).

Here, we detect the error early and exit with a nicer msg. Also add a
config option to just no-op dynamo (which need to unblock internal
enablement).

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87797
Approved by: https://github.com/yf225, https://github.com/soumith, https://github.com/jansel
2022-10-27 17:17:59 +00:00
f1b78224ca Fix type promotion for 2 wrapped scalar args (#87845)
Fixes #76801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87845
Approved by: https://github.com/SherlockNoMad, https://github.com/mruberry
2022-10-27 15:53:11 +00:00
03d6af4db3 add nesting to TORCH_SHOW_DISPATCH_TRACE (#87751)
Added indents to `TORCH_SHOW_DISPATCH_TRACE` so that you more easily see the call tree from the dispatcher. Definitely slower, but it's all guarded under the `DEBUG` build. Example output:

I know we have the PyDispatcher now, but I still found this helpful for debugging

```
 [call] op=[aten::ones], key=[BackendSelect]
  [redispatch] op=[aten::ones], key=[CPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::fill_.Scalar], key=[CPU]
 [call] op=[aten::clone], key=[AutogradCPU]
  [redispatch] op=[aten::clone], key=[CPU]
   [call] op=[aten::empty_strided], key=[BackendSelect]
    [redispatch] op=[aten::empty_strided], key=[CPU]
   [call] op=[aten::copy_], key=[CPU]
 [call] op=[aten::view], key=[PythonTLSSnapshot]
  [redispatchBoxed] op=[aten::view], key=[AutogradCPU]
   [redispatch] op=[aten::view], key=[ADInplaceOrView]
    [redispatch] op=[aten::view], key=[Functionalize]
     [call] op=[aten::view], key=[PythonTLSSnapshot]
      [redispatchBoxed] op=[aten::view], key=[Meta]
     [call] op=[aten::view], key=[PythonTLSSnapshot]
      [redispatchBoxed] op=[aten::view], key=[Python]
       [callBoxed] op=[aten::view], key=[CPU]
 [call] op=[aten::clone], key=[PythonTLSSnapshot]
  [redispatchBoxed] op=[aten::clone], key=[AutogradCPU]
   [redispatch] op=[aten::clone], key=[Functionalize]
    [callBoxed] op=[aten::clone], key=[PythonTLSSnapshot]
     [redispatchBoxed] op=[aten::clone], key=[Python]
      [callBoxed] op=[aten::clone], key=[CPU]
       [call] op=[aten::empty_strided], key=[BackendSelect]
        [redispatch] op=[aten::empty_strided], key=[CPU]
       [call] op=[aten::copy_], key=[CPU]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87751
Approved by: https://github.com/ezyang, https://github.com/zou3519
2022-10-27 15:47:56 +00:00
23ff47ccc5 functionalization: fix detach() (#87750)
`.detach()` worked in basic cases previously, but didn't properly preserve view relationships between the base and the output. This wasn't heavily tested, because autograd doesn't normally encounter `FunctionalTensorWrapper` directly, but could become more common if we fuse functionalization and autograd into a single tracing pass.

This will also be a bug fix for LTC (and XLA when they use functionalization)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87750
Approved by: https://github.com/ezyang
2022-10-27 15:47:56 +00:00
e2bbc0a134 [BE] Move remaining workflows off Xenial (#87834)
Both BE and prerequisite for moving our CI/CD to C++17 compiler (gcc-5.4
is not fully C++17 compliant)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87834
Approved by: https://github.com/weiwangmeta, https://github.com/kit1980, https://github.com/huydhn
2022-10-27 15:38:48 +00:00
1e1b045128 [ROCM] Enable Sparse Pickle Test (#82729)
Missed stream context for serialization

### Description
Missing ROCm stream context on memory operations for serialization

### Testing
Ran the sparse pickle test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82729
Approved by: https://github.com/ngimel
2022-10-27 15:11:28 +00:00
aaba0bd306 [JIT] Fix torch.jit.script for functions with many decorators (#87804)
Summary:
Python's function parsing from the `ast` module records the line number of the function definition, not the first decorator. So this diff fixes crashes like this:

```
IndexError: vector::_M_range_check: __n (which is 10) >= this->size() (which is 8)
```

Test Plan: New unit test

Differential Revision: D40726352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87804
Approved by: https://github.com/tugsbayasgalan, https://github.com/davidberard98
2022-10-27 12:29:51 +00:00
1780e0ef7f [complex] conv_transpose2d (#81805)
Reference: https://github.com/pytorch/pytorch/issues/71108

Fixes : #86414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81805
Approved by: https://github.com/anjali411
2022-10-27 10:46:53 +00:00
c36db82e12 TorchDynamo: Add convolution unary fusion for cpu in inference mode (#87063)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87063
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-27 06:55:32 +00:00
b16b5fb802 [Profiler] Hold weak reference to prevent TensorImpl address reuse during profiling. (#87244)
A recurring problem with assigning Tensor IDs is that we want to preserve identity when storage changes but we don't observe TensorImpl destruction so identity assignment is not robust to the ABA problem with respect to TensorImpl*. ~TensorImpl is far too hot to instrument; even adding a call to a no-op function in a different compilation unit increases overhead by tens of percent. (OSS builds do not have any sort of LTO.)

Fortunately there is a solution. A PyTorch Tensor is a `c10::intrusive_ptr<c10::TensorImpl>`, which in turn holds a storage. (Which is a `c10::intrusive_ptr<c10::StorageImpl>`) `c10::intrusive_ptr` has a `c10::weak_intrusive_ptr` class for taking non-owning references to the underlying object. The implementation involves both a strong refcount and weak refcount in `c10::intrusive_ptr`. If the strong refcount of an intrusive_ptr goes to zero and there are no weak references then everything is deleted. However if there is a weak reference then the intrusive_ptr calls `release_resources()` but not delete.

This has the effect of freeing the underlying resources (ensuring that program semantics are unchanged) but leaves behind an empty shell of an `intrusive_ptr` that the `weak_intrusive_ptr`s use to check status. And herein lies the solution: as long as we hold a weak reference to a TensorImpl we will block deletion and prevent the `TensorImpl*` from being reused.

This PR uses a `c10::weak_intrusive_ptr<c10::TensorImpl>` to store the address of profiled TensorImpls and then converts it to a raw pointer (or rather, a `TensorImplAddress`) during post processing when we no longer care about blocking address reuse.

Differential Revision: [D40492848](https://our.internmc.facebook.com/intern/diff/D40492848/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87244
Approved by: https://github.com/slgong-fb, https://github.com/albanD
2022-10-27 06:38:11 +00:00
4b23905172 [torch] Add torch cpp cpu target for torch/csrc/api/src files (#87327)
Summary: Duplicating fbcode target `fbcode//caffe2:torch-cpp-cpu` target in xplat. In D40460749 our user wants to use `torch::kNearest` enum which is defined in `torch/csrc/api/src/enum.cpp`. Adding this target to support it.

Test Plan: Rely on CI

Differential Revision: D40532087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87327
Approved by: https://github.com/ezyang
2022-10-27 06:04:22 +00:00
bf113e38fa use nv_diag_suppress (#87712)
Fixes:
```
/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/cuda/UnaryFractionKernels.cu(125): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead

/dev/shm/rbarnes/tempfs/pytorch/aten/src/ATen/native/sparse/cuda/SparseMatMul.cu(73): warning #20236-D: pragma "diag_suppress" is deprecated, use "nv_diag_suppress" instead
```

cc @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87712
Approved by: https://github.com/soumith
2022-10-27 05:15:16 +00:00
107f92a683 [FSDP] ufmt FSDP test (#87812)
This applies `ufmt` to all of the FSDP test files in the `test/distributed/fsdp/` directory.

**Test Plan**
CI

**Notes**
For VSCode users,
- Install `ufmt`: https://pypi.org/project/ufmt/
- Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
- Include in `settings.json`:
```
{
    "[python]": {
        "editor.defaultFormatter": "omnilib.ufmt",
        "editor.formatOnSave": true,
    },
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87812
Approved by: https://github.com/rohan-varma
2022-10-27 04:25:55 +00:00
e3cf81e0a7 [FSDP] ufmt /fsdp (#87811)
This applies `ufmt` to all of the FSDP files in the `torch/distributed/fsdp/` directory.

**Test Plan**
CI

**Notes**
For VSCode users,
- Install `ufmt`: https://pypi.org/project/ufmt/
- Install VSCode `ufmt` extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
- Include in `settings.json`:
```
{
    "[python]": {
        "editor.defaultFormatter": "omnilib.ufmt",
        "editor.formatOnSave": true,
    },
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87811
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2022-10-27 04:25:55 +00:00
49ce3ed14c [vision hash update] update the pinned vision hash (#87831)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87831
Approved by: https://github.com/pytorchbot
2022-10-27 04:23:45 +00:00
21bef8e944 fix sym_storage conversion and some cleanup (#87718)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87718
Approved by: https://github.com/ezyang
2022-10-27 02:45:18 +00:00
58650835bb [fx][subgraph_rewriter] Change match_filter to be a List in replace_pattern_with_filters (#87257)
Summary:
att, this is experimental api so not marking it as bc-breaking.
The match will be accepted only if all the filters in the list passes.
Changing the filter arg to be list also allows us to pass in empty list that means no filter, which makes user code cleaner.

Test Plan:
python test/test_fx.py -k test_replace_pattern_with_filters

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87257
Approved by: https://github.com/SherlockNoMad
2022-10-27 01:59:19 +00:00
195a13f48c [quant][be] Remove unused function quantize_node (#87153)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87153
Approved by: https://github.com/andrewor14
2022-10-27 01:50:00 +00:00
30ea8f5c20 Limit ROCM option to Linux only (#87833)
As it's not available on neither Windows nor MacOS

cc @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87833
Approved by: https://github.com/kit1980
2022-10-27 01:24:03 +00:00
0e3b5ea026 [quant][fx] Add _convert_to_reference_decomposed (#87094)
Summary:
_convert_to_reference_decomposed is a private convert function in fx graph mode quantization flow to convert
a calibrated/trained model to a reference quantized model with decomposed quantized tensor representations.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87094
Approved by: https://github.com/andrewor14
2022-10-27 01:22:08 +00:00
a12d3d6b49 [profiler] Standard performance event names for the profiler (#87538)
Summary: The goal is to create a hardware/backend independent event abstraction on which a standard set of tooling can be developed.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D40238034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87538
Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign
2022-10-27 00:59:40 +00:00
2cc624cd43 Enable mypy check for distributed.py, and fix type errors (#87543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87543
Approved by: https://github.com/fduwjj
2022-10-27 00:22:54 +00:00
5dbd80a605 [pytorch] Layer norm backward speed gain with warp shuffles (#87814)
Summary:
Improved native layer norm backward performance.

Rewrote `GammaBetaBackwardCUDAKernel` to use shared memory only for the reduction step, but not for loading `mean` and `rstd`. The previous implementation used only `threadIdx.x = 0` to load `mean` and `rstd` into shared memory, and then all threads would access the values in order to do loop unrolling. This approached increased register usage and decreased occupancy, without much benefit from using shared memory (this is because the values were already cached in L1). The new implementation is simpler and register usage is smaller, thus occupancy is better.

Added another implementation called `GammaBetaBackwardCUDAKernel_32x32` which is only for shapes dividing exactly to a (32 x 32) block. This permits using warp shuffles for speeding up loading `mean` and `rstd` as well as for the final reduction stage. The effective bandwidth of this implementation is equal to STREAM Triad.

Observed that we can get additional benefit if we lower the threshold for calling `GammaBetaBackwardSimpleCUDAKernel` (simple col-wise reduction implementation) from `512` to `128`.

Test Plan:
Wrote a simple CUDA app that calls the previous implementation of `GammaBetaBackwardCUDAKernel` and the current one, using FP32 values and compares the results. The epsilon value we used for FP comparison is 0.00001 for the weight and 0.0001 for the bias.
Ran the benchmark for various sizes A100 GPU and got the results below. Almost all sizes show good speedup.

```
Size (32, 32); Mismatches: dg = 0 db = 0 out of 32. reference = 0.0073 (ms); optimized = 0.0071 (ms); bw_opt = 1.14 GB/s; speedup = 2.68%
Size (64, 32); Mismatches: dg = 0 db = 0 out of 32. reference = 0.0107 (ms); optimized = 0.0107 (ms); bw_opt = 1.50 GB/s; speedup = 0.22%
Size (256, 128); Mismatches: dg = 0 db = 0 out of 128. reference = 0.0323 (ms); optimized = 0.0075 (ms); bw_opt = 32.89 GB/s; speedup = 330.16%
Size (512, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0103 (ms); optimized = 0.0089 (ms); bw_opt = 440.54 GB/s; speedup = 15.82%
Size (1024, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0197 (ms); optimized = 0.0136 (ms); bw_opt = 1151.44 GB/s; speedup = 44.91%
Size (2048, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0416 (ms); optimized = 0.0283 (ms); bw_opt = 1105.31 GB/s; speedup = 47.01%
Size (4096, 16384); Mismatches: dg = 0 db = 0 out of 16384. reference = 0.4420 (ms); optimized = 0.3915 (ms); bw_opt = 1277.58 GB/s; speedup = 12.90%
Size (70000, 64); Mismatches: dg = 0 db = 0 out of 64. reference = 0.5908 (ms); optimized = 0.6850 (ms); bw_opt = 49.49 GB/s; speedup = -13.75%
Size (131072, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 1.1961 (ms); optimized = 0.9234 (ms); bw_opt = 542.54 GB/s; speedup = 29.53%
Size (1000, 520); Mismatches: dg = 0 db = 0 out of 520. reference = 0.0132 (ms); optimized = 0.0113 (ms); bw_opt = 343.83 GB/s; speedup = 16.88%
Size (4005, 4005); Mismatches: dg = 0 db = 0 out of 4005. reference = 0.1441 (ms); optimized = 0.1054 (ms); bw_opt = 1134.36 GB/s; speedup = 36.71%
Size (10000, 1000); Mismatches: dg = 0 db = 0 out of 1000. reference = 0.1293 (ms); optimized = 0.1248 (ms); bw_opt = 597.71 GB/s; speedup = 3.63%
Size (1024, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.0738 (ms); optimized = 0.0735 (ms); bw_opt = 1039.40 GB/s; speedup = 0.45%
Size (8192, 4096); Mismatches: dg = 0 db = 0 out of 4096. reference = 0.2673 (ms); optimized = 0.2223 (ms); bw_opt = 1125.01 GB/s; speedup = 20.25%
Size (10000, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.7331 (ms); optimized = 0.8940 (ms); bw_opt = 833.54 GB/s; speedup = -18.00%
Size (3072, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.2087 (ms); optimized = 0.2364 (ms); bw_opt = 968.64 GB/s; speedup = -11.71%
Size (6144, 10000); Mismatches: dg = 0 db = 0 out of 10000. reference = 0.4197 (ms); optimized = 0.5118 (ms); bw_opt = 894.63 GB/s; speedup = -18.00%
Size (1024, 20000); Mismatches: dg = 0 db = 0 out of 20000. reference = 0.1480 (ms); optimized = 0.1297 (ms); bw_opt = 1177.68 GB/s; speedup = 14.12%
Size (1024, 20000); Mismatches: dg = 0 db = 0 out of 20000. reference = 0.1483 (ms); optimized = 0.1278 (ms); bw_opt = 1195.26 GB/s; speedup = 16.04%
Size (512, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0104 (ms); optimized = 0.0091 (ms); bw_opt = 646.72 GB/s; speedup = 14.44%
Size (512, 6144); Mismatches: dg = 0 db = 0 out of 6144. reference = 0.0219 (ms); optimized = 0.0156 (ms); bw_opt = 1506.30 GB/s; speedup = 40.52%
Size (512, 10240); Mismatches: dg = 0 db = 0 out of 10240. reference = 0.0424 (ms); optimized = 0.0370 (ms); bw_opt = 1057.84 GB/s; speedup = 14.63%
Size (1000, 1000); Mismatches: dg = 0 db = 0 out of 1000. reference = 0.0139 (ms); optimized = 0.0119 (ms); bw_opt = 627.51 GB/s; speedup = 16.83%
Size (2000, 2000); Mismatches: dg = 0 db = 0 out of 2000. reference = 0.0421 (ms); optimized = 0.0412 (ms); bw_opt = 724.10 GB/s; speedup = 2.20%
Size (10240, 10240); Mismatches: dg = 0 db = 0 out of 10240. reference = 0.7210 (ms); optimized = 0.6098 (ms); bw_opt = 1281.40 GB/s; speedup = 18.24%
Size (384, 128); Mismatches: dg = 0 db = 0 out of 128. reference = 0.0449 (ms); optimized = 0.0089 (ms); bw_opt = 41.50 GB/s; speedup = 403.48%
Size (2048, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0208 (ms); optimized = 0.0169 (ms); bw_opt = 925.70 GB/s; speedup = 23.13%
Size (267, 513); Mismatches: dg = 0 db = 0 out of 513. reference = 0.0342 (ms); optimized = 0.0090 (ms); bw_opt = 114.18 GB/s; speedup = 280.64%
Size (67, 123479); Mismatches: dg = 0 db = 0 out of 123479. reference = 0.0562 (ms); optimized = 0.0552 (ms); bw_opt = 1133.46 GB/s; speedup = 1.81%
Size (1024, 123479); Mismatches: dg = 0 db = 0 out of 123479. reference = 0.8573 (ms); optimized = 0.9245 (ms); bw_opt = 1020.02 GB/s; speedup = -7.27%
Size (2048, 66679); Mismatches: dg = 0 db = 0 out of 66679. reference = 0.8778 (ms); optimized = 0.8590 (ms); bw_opt = 1185.05 GB/s; speedup = 2.19%
Size (200, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0215 (ms); optimized = 0.0066 (ms); bw_opt = 58.49 GB/s; speedup = 226.81%
Size (1000, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0109 (ms); optimized = 0.0092 (ms); bw_opt = 208.27 GB/s; speedup = 18.65%
Size (6000, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0394 (ms); optimized = 0.0301 (ms); bw_opt = 381.90 GB/s; speedup = 30.98%
Size (6272, 256); Mismatches: dg = 0 db = 0 out of 256. reference = 0.0403 (ms); optimized = 0.0300 (ms); bw_opt = 400.48 GB/s; speedup = 34.34%
Size (200, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0218 (ms); optimized = 0.0066 (ms); bw_opt = 116.33 GB/s; speedup = 229.96%
Size (1000, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0110 (ms); optimized = 0.0094 (ms); bw_opt = 407.29 GB/s; speedup = 17.26%
Size (6000, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0535 (ms); optimized = 0.0594 (ms); bw_opt = 386.05 GB/s; speedup = -9.95%
Size (6272, 512); Mismatches: dg = 0 db = 0 out of 512. reference = 0.0573 (ms); optimized = 0.0387 (ms); bw_opt = 619.62 GB/s; speedup = 48.06%
Size (200, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0221 (ms); optimized = 0.0069 (ms); bw_opt = 222.78 GB/s; speedup = 220.76%
Size (1000, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0113 (ms); optimized = 0.0097 (ms); bw_opt = 787.79 GB/s; speedup = 16.46%
Size (6000, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0723 (ms); optimized = 0.0715 (ms); bw_opt = 640.95 GB/s; speedup = 1.10%
Size (6272, 1024); Mismatches: dg = 0 db = 0 out of 1024. reference = 0.0751 (ms); optimized = 0.0572 (ms); bw_opt = 837.57 GB/s; speedup = 31.30%
Size (200, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0232 (ms); optimized = 0.0071 (ms); bw_opt = 323.97 GB/s; speedup = 226.51%
Size (1000, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0125 (ms); optimized = 0.0114 (ms); bw_opt = 1005.84 GB/s; speedup = 9.62%
Size (6000, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0807 (ms); optimized = 0.0830 (ms); bw_opt = 828.02 GB/s; speedup = -2.76%
Size (6272, 1536); Mismatches: dg = 0 db = 0 out of 1536. reference = 0.0836 (ms); optimized = 0.0695 (ms); bw_opt = 1033.62 GB/s; speedup = 20.27%
Size (200, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0224 (ms); optimized = 0.0075 (ms); bw_opt = 408.58 GB/s; speedup = 198.10%
Size (1000, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0165 (ms); optimized = 0.0135 (ms); bw_opt = 1132.42 GB/s; speedup = 22.26%
Size (6000, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.0993 (ms); optimized = 0.0989 (ms); bw_opt = 926.35 GB/s; speedup = 0.41%
Size (6272, 2048); Mismatches: dg = 0 db = 0 out of 2048. reference = 0.1033 (ms); optimized = 0.0826 (ms); bw_opt = 1159.55 GB/s; speedup = 25.09%
Size (200, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.0230 (ms); optimized = 0.0076 (ms); bw_opt = 605.09 GB/s; speedup = 202.51%
Size (1000, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.0207 (ms); optimized = 0.0213 (ms); bw_opt = 1076.45 GB/s; speedup = -2.69%
Size (6000, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.1198 (ms); optimized = 0.1274 (ms); bw_opt = 1078.58 GB/s; speedup = -5.95%
Size (6272, 3072); Mismatches: dg = 0 db = 0 out of 3072. reference = 0.1293 (ms); optimized = 0.1189 (ms); bw_opt = 1207.95 GB/s; speedup = 8.76%

Average speedup = 52.88%
```

For additional numerical validation used the following script:

```
def run_model_on_device(fs, X, gO, device_string, numeric_type):
    ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type)
    ln.reset_parameters()
    X.grad = None
    ln.zero_grad(set_to_none=True)
    out = ln(X)
    out.backward(gO)
    return (ln.weight.grad, ln.bias.grad)

def run_correctness_test(eps_weight, eps_bias):
    dtype = torch.float
    for fs in (512, 1024, 2048, 4096, 8192, 10000, 500, 1000, 2001, 4005, 8117):
        for bs in (512, 1024, 2048, 4096, 525, 1033, 2064, 3000):
            mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float)
            X = mean_adjustment * torch.randn(
                bs, fs, device="cpu", dtype=torch.float, requires_grad=True
            )

            X = X.detach().requires_grad_()
            gO = torch.rand_like(X)
            X_gpu = X.to("cuda")
            X_gpu = X_gpu.detach().requires_grad_()
            gO_gpu = gO.to("cuda")
            gO_gpu = gO_gpu.detach().requires_grad_()

            grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype)
            grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype)
            weight_grad_gpu_target = grad_gpu[0].detach().to("cpu")
            bias_grad_gpu_target = grad_gpu[1].detach().to("cpu")

            weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target)
            weight_mismatches = (weight_delta >= eps_weight).nonzero()
            weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100

            bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target)
            bias_mismatches = (bias_delta >= eps_bias).nonzero()
            bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100

            print(
                "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format(
                    fs, bs, weight_mismatch_pct, bias_mismatch_pct
                )
            )
```

`NVFuserTest.FusionMagicSchedulerLayerNormBackward_CUDA` test also does additional numerical validation and it passes.

Differential Revision: D40730981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87814
Approved by: https://github.com/weiwangmeta
2022-10-27 00:18:19 +00:00
449778a939 Fix typos under .github directory (#87828)
This PR fixes typos in `.md` files under .github directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87828
Approved by: https://github.com/clee2000
2022-10-27 00:01:10 +00:00
2c66889f90 Synchronize before change cuda stream (#82050) (#82056)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/82050

Need synchronize before change cuda stream

### Description
<!-- What did you change and why was it needed? -->

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82056
Approved by: https://github.com/ngimel
2022-10-26 23:44:13 +00:00
59b9d29260 [primTorch] Check error_regex in test_python_ref_errors (#86987)
cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86987
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-26 23:34:34 +00:00
5ee5f5ac1b [BE] Don't build CUDA-10.2 docker images (#87819)
As CUDA-10.2 should not longer be used in CI/CD

Test Plan: ` grep cuda10.2 .github -R|grep -v mock`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87819
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2022-10-26 23:16:29 +00:00
3208c2f6bd Add logging for nested tensor usage tracking (#87632)
# Summary
Add logging message so that we can track nested tensor adoption.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87632
Approved by: https://github.com/cpuhrsch
2022-10-26 22:42:41 +00:00
536474e823 [LTC] Remove tensor.storage_ (#87645)
Summary:
Since LTC now supports functionalization, we don't need to fake a storage to support is_alias_of anymore. Let's remove it.

Test Plan:
 ./build/bin/test_lazy --gtest_filter=LazyOpsTest.IsAliasOf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87645
Approved by: https://github.com/JackCaoG, https://github.com/bdhirsh
2022-10-26 22:41:19 +00:00
5edbc92683 print stderr for ghstack rebase (#87795)
current output tends to be empty on failure, which makes it hard to debug
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87795
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2022-10-26 22:10:10 +00:00
91c95ff7c5 Enable graph_split_inductor test as it runs now (#87762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87762
Approved by: https://github.com/davidberard98
2022-10-26 22:06:03 +00:00
53c640a528 [CI] Delete nnpack installation from conda (#87813)
Not sure why it was there to begin with and I really hope none of our CI depend on the package that was last updated 5 years ago, see https://anaconda.org/killeent/nnpack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87813
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/ZainRizvi
2022-10-26 21:51:13 +00:00
1522946882 Simplify installation instruction in contributing file (#87460)
Simplification of one of the installation instructions in CONTRIBUTING.md that I found tricky to parse at first.

Also adds a link to the "Make no-op build fast" section to make it easier to navigate to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87460
Approved by: https://github.com/ngimel
2022-10-26 21:34:13 +00:00
adb76ef510 Expose API for backward execution order (#87507)
In this PR:
- graph_task stores graph roots on construction so that we can later traverse through the graph
- before the nodes are returned, they needed to be converted from raw_ptr to shared_ptr, and this should be OK because the graph is guaranteed to be alive

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87507
Approved by: https://github.com/albanD
2022-10-26 21:28:45 +00:00
926827b89c Revert "Disable linux-bionic-py3_7-clang8-xla-test (#87737)"
This reverts commit 21f7e7d040c646b4ce7f4a4e973da97660462bdc.

Reverted https://github.com/pytorch/pytorch/pull/87737 on behalf of https://github.com/kit1980 due to Re-enable XLA tests after https://github.com/pytorch/pytorch/pull/87818
2022-10-26 21:01:09 +00:00
71933d381b [ao] Fixing tests for block pruning shapes (#87326)
The current unittests were only checking the tensors whose shapes were already multiples of the block size. That caused some hidden bugs to creep in. Specifically, for the shapes that would require padding for the mask/data, the sparsifier would try to apply shape-mismatching tensors onto each other. This caused segfaults as well as silent failures.

This makes minor adjustments to the code to make sure the masks and data shapes are aligned, as well as fixing the tests to catch this.

Test Plan:

```python
python test/test_ao_sparsity.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87326
Approved by: https://github.com/jcaip
2022-10-26 20:55:14 +00:00
1168f42790 Update XLA hash (#87818)
This is a re-creation of https://github.com/pytorch/pytorch/pull/87808 so we don't have to wait.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87818
Approved by: https://github.com/clee2000
2022-10-26 20:54:25 +00:00
bbcd4b2f2f Clean up CPU test in test_torchinductor.py for fbcode (#87783)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87783
Approved by: https://github.com/bertmaher
2022-10-26 20:47:14 +00:00
88eff10722 [ONNX] Deprecate operators.py (#87798)
Deprecate `torch.onnx.operators` because it's only for backwards compatibility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87798
Approved by: https://github.com/BowenBao
2022-10-26 20:42:06 +00:00
b21fe312c0 Fix meta for index_add and index_put (#87775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87775
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-10-26 20:33:23 +00:00
8016fd9eb1 Set check-latest to false when setup python and pip cache in CI (#87621)
I missed the fine print in https://github.com/actions/setup-python/blob/main/README.md#caching-packages-dependencies when setting up the cache using setup-python GHA

> Restored cache will not be used if the requirements.txt file is not updated for a long time and a newer version of the dependency is available which can lead to an increase in total build time.

The latter part is important because it implies that even with the cache, pip will still try to check if a newer version exists and that part can be flaky, i.e. https://github.com/pytorch/pytorch/actions/runs/3313764038/jobs/5472180293

This undesired behavior can be turned off by setting the advance option `check-latest` to false https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#check-latest-version. Per my understanding, this should tell pip install in these workflows to use the local cached copy of the package avoiding the need to query pypi every single time.

`check-latest` was added quite recently https://github.com/actions/setup-python/pull/406, so `actionlint-1.6.15` fails to recognize it. Thus, this PR also upgrades `actionlint` to the latest 1.6.21 to pass the linter check. Here is an example error from 1.6.15 from https://github.com/pytorch/pytorch/actions/runs/3315388073/jobs/5475918454:

```
>>> Lint for .github/workflows/lint.yml:

  Error (ACTIONLINT) [action]
    input "check-latest" is not defined in action "actions/setup-python@v4".
    available inputs are "architecture", "cache", "cache-dependency-path",
    "python-version", "python-version-file", "token"

         25  |        with:
         26  |          python-version: 3.8
         27  |          architecture: x64
    >>>  28  |          check-latest: false
         29  |          cache: pip
         30  |          cache-dependency-path: |
         31  |            **/.github/requirements-gha-cache.txt
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87621
Approved by: https://github.com/ZainRizvi
2022-10-26 20:08:29 +00:00
5f4329134e Revert "Set check-latest to false when setup python and pip cache in CI (#87621)"
This reverts commit 4080b1db284fd531654bcb2984a7fe0ff3b310cd.

Reverted https://github.com/pytorch/pytorch/pull/87621 on behalf of https://github.com/huydhn due to Somehow setup-python treats Python 3.10 as Python 3.1 in pr-label.yml. I missed this signal because this is only run at push
2022-10-26 19:40:53 +00:00
38dd4cbdf1 ROCm enable sparse_sampled_addmm (#86401)
Enables:
test_comprehensive_sparse_sampled_addmm_cuda_complex128
test_comprehensive_sparse_sampled_addmm_cuda_complex64
test_comprehensive_sparse_sampled_addmm_cuda_float32
test_comprehensive_sparse_sampled_addmm_cuda_float64
test_dispatch_meta_sparse_sampled_addmm_cuda_complex128
test_dispatch_meta_sparse_sampled_addmm_cuda_complex64
test_dispatch_meta_sparse_sampled_addmm_cuda_float32
test_dispatch_meta_sparse_sampled_addmm_cuda_float64
test_meta_sparse_sampled_addmm_cuda_complex128
test_meta_sparse_sampled_addmm_cuda_complex64
test_meta_sparse_sampled_addmm_cuda_float32
test_meta_sparse_sampled_addmm_cuda_float64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86401
Approved by: https://github.com/ngimel
2022-10-26 19:39:24 +00:00
123b103bf1 Add dynamo_optimize_ddp arg to dist bench (#87768)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87768
Approved by: https://github.com/davidberard98
2022-10-26 19:29:35 +00:00
aa66c6e01e Fix missing weight init and clean up helper (#87760)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87760
Approved by: https://github.com/davidberard98
2022-10-26 19:29:35 +00:00
58dc95b321 Fix typos under aten directory (#87754)
This PR fixes typos in `.md` files under aten directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87754
Approved by: https://github.com/kit1980
2022-10-26 19:29:05 +00:00
4080b1db28 Set check-latest to false when setup python and pip cache in CI (#87621)
I missed the fine print in https://github.com/actions/setup-python/blob/main/README.md#caching-packages-dependencies when setting up the cache using setup-python GHA

> Restored cache will not be used if the requirements.txt file is not updated for a long time and a newer version of the dependency is available which can lead to an increase in total build time.

The latter part is important because it implies that even with the cache, pip will still try to check if a newer version exists and that part can be flaky, i.e. https://github.com/pytorch/pytorch/actions/runs/3313764038/jobs/5472180293

This undesired behavior can be turned off by setting the advance option `check-latest` to false https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#check-latest-version. Per my understanding, this should tell pip install in these workflows to use the local cached copy of the package avoiding the need to query pypi every single time.

`check-latest` was added quite recently https://github.com/actions/setup-python/pull/406, so `actionlint-1.6.15` fails to recognize it. Thus, this PR also upgrades `actionlint` to the latest 1.6.21 to pass the linter check. Here is an example error from 1.6.15 from https://github.com/pytorch/pytorch/actions/runs/3315388073/jobs/5475918454:

```
>>> Lint for .github/workflows/lint.yml:

  Error (ACTIONLINT) [action]
    input "check-latest" is not defined in action "actions/setup-python@v4".
    available inputs are "architecture", "cache", "cache-dependency-path",
    "python-version", "python-version-file", "token"

         25  |        with:
         26  |          python-version: 3.8
         27  |          architecture: x64
    >>>  28  |          check-latest: false
         29  |          cache: pip
         30  |          cache-dependency-path: |
         31  |            **/.github/requirements-gha-cache.txt
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87621
Approved by: https://github.com/ZainRizvi
2022-10-26 19:23:55 +00:00
2c1efe7472 Enable some PyTorch core tests with inductor (#87490)
Summary:
1) Graph break on torch.random.set_rng_state since it blocks running
inductor core tests;
2) Add several inductor-specific skips;
3) Enable several core tests for inductor CI;

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87490
Approved by: https://github.com/eellison
2022-10-26 18:58:33 +00:00
f7a04f310b [ao][ns] Replacing List[QConfigMapping] in PNP (#86922)
Summary: Added QConfigMultiMapping which is essentially a
List[QConfigMapping] with set methods and dedicated handling to
avoid unwanted matches and improve UX.

note: the from __future__ import annotations line caused weird errors when the
QConfigMultiMapping class was put in _numeric_suite_fx.py so it was moved.

Test Plan: python test/test_quantization.py TestFxNumericSuiteNShadows

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86922
Approved by: https://github.com/vkuzo
2022-10-26 18:56:53 +00:00
9639cb83eb Revert "[pytorch] Layer norm backward speed gain with warp shuffles (#87445)"
This reverts commit b6f28334bc3276a56d79dea6cb7ed99411556348.

Reverted https://github.com/pytorch/pytorch/pull/87445 on behalf of https://github.com/weiwangmeta due to breaking internal builds due to MS compiler
2022-10-26 18:51:38 +00:00
585d71513d Add type annotations to distribution.py (#87577)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87577
Approved by: https://github.com/kit1980
2022-10-26 18:50:48 +00:00
16e35bd179 Adding expm1 to MPS (#87147)
Fixes #86744

- Implementing the new `expm1_out_mps` function in `aten/src/ATen/native/mps/operations/UnaryOps.mm`
- Adding it to `aten/src/ATen/native/native_functions.yaml`
- Adding it to existing `test.test_mps.TestNLLLoss.test_unary_ops`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87147
Approved by: https://github.com/kulinseth
2022-10-26 17:45:46 +00:00
493ff6ac5b Install py for pytest-sugar (#87803)
linux-focal-py3.7-clang10-onnx / test is failng, the issue is https://github.com/Teemu/pytest-sugar/issues/241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87803
Approved by: https://github.com/seemethere, https://github.com/huydhn
2022-10-26 17:43:35 +00:00
e2e428b03c Remove custom Ceil in favor of sympy.ceiling (#87294)
[Alban]: the other changes that used to be in this PR (neg and fix for true div) are moved to other places where they already exist. Namely neg is already in master and true div will be in the next PR on the stack where all other functions are fixed at the same time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87294
Approved by: https://github.com/ezyang
2022-10-26 17:33:53 +00:00
777e6a2c51 Many symintifications (#87604)
Adds
expand_inplace
conv conv_double_backward
convolution
adaptive_avg_pool2d_symint
_embedding_bag_backward_symint
cudnn_grid_sampler
cuda 32 bit indexing
nll_loss / nll_loss_2d
tensor split
pooling same mode
cudnn_is_acceptable
storage nbytes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87604
Approved by: https://github.com/ezyang
2022-10-26 17:33:53 +00:00
ae4fbac819 Enable nvprims.transpose fusions for nvFuser (#86967)
This PR allows transposes to be fused with other operations. If a fusion group is formed only from operations that just manipulate metadata in PyTorch (transpose, view, etc.) then this group is not sent to nvFuser.
On top of that if we have converted to `nvprims` but then decided to not form a fusion group we modify the graph use `prim.impl_aten` attribute instead of calling `prim(*args, **kwargs)` that has a higher overhead.

cc @kevinstephano @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86967
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-10-26 17:00:07 +00:00
ac0c13f665 Revert "[ROCm] Use -rpath-link to fix libtinfo conflict (#83552)"
This reverts commit a10446c4d826ae5505fa129ea9800d3924b25364.

Reverted https://github.com/pytorch/pytorch/pull/83552 on behalf of https://github.com/kit1980 due to Broke ios/macos builds https://github.com/pytorch/pytorch/actions/runs/3329991911/jobs/5507911292
2022-10-26 16:43:13 +00:00
701b3dd773 optim utils all_gather_into_tensor (#87769)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87769
Approved by: https://github.com/awgu
2022-10-26 16:20:46 +00:00
642b63e1e7 Add test that import torch doesn't modify global logging state (#87629)
Fixes https://github.com/pytorch/pytorch/issues/87626

Also adds the same test for `import functorch`. Users have complained at
us when we do modify the global logging state, which has happened in the
past.

Test Plan:
- tested locally; I added `logging.basicConfig` to `torch/__init__.py`
and checked that the test got triggered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87629
Approved by: https://github.com/albanD
2022-10-26 15:53:28 +00:00
422f946b8c [FSDP][BE] Improve the assert message of sharded load_state_dict (#87486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87486
Approved by: https://github.com/awgu
2022-10-26 15:51:54 +00:00
c2ef5c4f7e [ROCm] Move ROCm CI build to python 3.8 version (#86677)
Currently it is python 3.7 want to upgrade to python 3.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86677
Approved by: https://github.com/malfet
2022-10-26 15:34:38 +00:00
775fef51b7 Implement copy_, fill_, and ones_like for Nested Tensors backends (#87728)
Summary: This diff implements copy_ in order to allow pinned memory transfers for nested tensors, as well as fill_ and ones_like, to test whether nested tensors can be created with other factory functions.

Test Plan: Pass all CI and sandcastle jobs.

Reviewed By: mikekgfb

Differential Revision: D40689594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87728
Approved by: https://github.com/cpuhrsch
2022-10-26 14:48:27 +00:00
a10446c4d8 [ROCm] Use -rpath-link to fix libtinfo conflict (#83552)
Fixes issue building PyTorch for ROCm5.3 and above on Ubuntu20.04 because libtinfo6 from conda conflicts with the one from the distro causing symbol not found errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83552
Approved by: https://github.com/malfet
2022-10-26 14:40:29 +00:00
ed7a8ab436 [Static Runtime] Make canEnableStaticRuntime examine sub-blocks (#87396)
Summary:
Someone was running into problems where

1) Static Runtime enablement would fail
2) We would try to fall back to the JIT interpreter *after trying to create `StaticModule`*
3) The fallback fails because Static Runtime mangled the graph.

We don't want to prevent Static Runtime from mutating its input due to memory concerns. The intent of `canEnableStaticRuntime` is to catch issues in the module before Static Runtime messes with it.

With this diff, `StaticModule` instantiation can be avoided by querying `canEnableStaticRuntime` and the issue is fixed.

Test Plan: New unit test

Differential Revision: D40564452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87396
Approved by: https://github.com/tenpercent
2022-10-26 14:34:29 +00:00
72f446b9bc Remove getitem special handling in the partitioner (#87073)
This special handling of getitem unnecessary splits fusions at functions with tuple outputs.

Example script:
```py
import torch
from torch.fx.passes.infra.partitioner import CapabilityBasedPartitioner
from torch._prims.nvfuser_executor import NvfuserPrimOperatorSupport
from torch.fx.experimental.proxy_tensor import make_fx

def func(x):
    xx = torch.ops.nvprims.add(x, 1)
    var, mean = torch.ops.nvprims.var_mean(x, correction=0)
    var_cos = torch.ops.nvprims.cos(var)
    mean_sin = torch.ops.nvprims.sin(mean)
    return torch.ops.nvprims.add(var_cos, mean_sin)

a = torch.randn(5, 3, 3, device="cuda")
gm = make_fx(func)(a)
gm.graph.print_tabular()

supported_ops = NvfuserPrimOperatorSupport()
partitioner = CapabilityBasedPartitioner(
    gm, supported_ops, allows_single_node_partition=False
)
partitions = partitioner.propose_partitions()
print(partitions)
partitioned_graph = partitioner.fuse_partitions(partitions)
partitioned_graph.graph.print_tabular()
```
Output on master:
```py
opcode         name       target                       args              kwargs
-------------  ---------  ---------------------------  ----------------  -----------------
placeholder    x_1        x_1                          ()                {}
call_function  add        nvprims.add.default          (x_1, 1)          {}
call_function  var_mean   nvprims.var_mean.main        (x_1, [0, 1, 2])  {'correction': 0}
call_function  getitem    <built-in function getitem>  (var_mean, 0)     {}
call_function  getitem_1  <built-in function getitem>  (var_mean, 1)     {}
call_function  cos        nvprims.cos.default          (getitem,)        {}
call_function  sin        nvprims.sin.default          (getitem_1,)      {}
call_function  add_1      nvprims.add.default          (cos, sin)        {}
output         output     output                       (add_1,)          {}
[{cos, sin, add_1}, {var_mean, add, getitem, getitem_1}]
opcode         name       target                       args                    kwargs
-------------  ---------  ---------------------------  ----------------------  --------
placeholder    x_1        x_1                          ()                      {}
call_module    fused_1    fused_1                      (x_1,)                  {}
call_function  getitem_2  <built-in function getitem>  (fused_1, 0)            {}
call_function  getitem_3  <built-in function getitem>  (fused_1, 1)            {}
call_module    fused_0    fused_0                      (getitem_2, getitem_3)  {}
output         output     output                       (fused_0,)              {}
```
Output with this PR:
```
[{var_mean, add_1, cos, sin, add, getitem_1, getitem}]
opcode       name     target    args        kwargs
-----------  -------  --------  ----------  --------
placeholder  x_1      x_1       ()          {}
call_module  fused_0  fused_0   (x_1,)      {}
output       output   output    (fused_0,)  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87073
Approved by: https://github.com/jjsjann123, https://github.com/SherlockNoMad
2022-10-26 14:18:46 +00:00
59aacc40ca Couple fixes for argmax/argmin (#87758)
Removes a wrong assert, makes min number of warps = 2 (1 for some reason generates invalid code, https://github.com/openai/triton/issues/802).
Hopefully fixes https://github.com/pytorch/torchdynamo/issues/1743, cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @mreso

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87758
Approved by: https://github.com/Chillee, https://github.com/soumith
2022-10-26 06:33:43 +00:00
0294787bd6 Format distributed.py (#87667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87667
Approved by: https://github.com/zhaojuanmao
2022-10-26 06:02:30 +00:00
a24635208b [Inductor] update triton commit pin (#87732)
Fixes https://github.com/pytorch/torchdynamo/issues/1746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87732
Approved by: https://github.com/ngimel
2022-10-26 05:40:25 +00:00
02797db24f [vision hash update] update the pinned vision hash (#87744)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87744
Approved by: https://github.com/pytorchbot
2022-10-26 05:09:42 +00:00
0d13ffbbae [inductor] Fix finalization issues when using multiprocessing (#87725)
If python was launched with 'spawn' it will not use the standard
shutdown methods that concurrent.futures requires. So we register a
shutdown with the method it does uses. Without this, shutdown hangs
since the workers will not exit.

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87725
Approved by: https://github.com/wconstab
2022-10-26 04:09:12 +00:00
8a6a126182 [FSDP][BE] Split state_dict related hooks to a separate file to reduce development conflicts (#87421)
This PR does following two things to improve the code quality.
1. Split state_dict related hooks to a separate file to reduce development conflicts.
2. Remove unused APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87421
Approved by: https://github.com/rohan-varma
2022-10-26 03:43:08 +00:00
82c8365c16 [BE] Delete TH_DISALLOW_COPY_AND_ASSIGN (#87743)
Replace it with `AT_DISALLOW_COPY_AND_ASSIGN` and delete the header that
contained this define

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87743
Approved by: https://github.com/atalman, https://github.com/ngimel
2022-10-26 03:31:56 +00:00
354549e033 [MPS] Use bandPartWithTensor:numLowerTensor:... (#87752)
To make it uniform with the rest of usage of this op throughout MPS codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87752
Approved by: https://github.com/kulinseth
2022-10-26 03:30:45 +00:00
de65f156ed Add distributed composable API contract (#87580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87580
Approved by: https://github.com/yhcharles
2022-10-26 02:36:40 +00:00
9c2555f018 Upgrade CI binary build runner from 4x to 12xlarge (#87727)
It currently takes a whopping 2h30m just to build PyTorch binary for every PR and commit. Pushing it to 12xlarge reduces the time to 1h40m https://github.com/pytorch/pytorch/actions/runs/3323869550/jobs/5494754029, not exactly a linear (and fair) trade, but good enough to reduce this long pole.

I'll monitor the queue for 12xlarge after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87727
Approved by: https://github.com/kit1980, https://github.com/malfet
2022-10-26 02:28:36 +00:00
85a79a7f50 [ONNX] Expand _cast_ symbolic functions (#87666)
The `_cast_` family of symbolic functions has been created from a template function. Even though it saved some lines, it very much obscured the intention of the code. Since the list doesn't really change and the `_cast_` family are IIRC deprecated, it is safe for us to expand the templates and make the code more readable.

This PR also removes any direct calls to `_cast_` functions to maintain a consistent pattern of directly creating `Cast` nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87666
Approved by: https://github.com/BowenBao
2022-10-26 00:39:59 +00:00
63397ac3f9 Disable ossf-scorecard (#87740)
Disable as it frequently fails https://github.com/pytorch/pytorch/actions/runs/3325113107/jobs/5497443452
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87740
Approved by: https://github.com/huydhn
2022-10-26 00:26:44 +00:00
c600ce39ed [ONNX] Refactor UnsupportedOperatorError arguments (#85349)
Merged the first two arguments because we always use qualified names to identify symbolic functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85349
Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao
2022-10-26 00:21:58 +00:00
57b36bf353 Bring back TIMM model inductor CI test (#87730)
Summary: https://github.com/pytorch/pytorch/pull/87588 has solved the
inductor compilation speed regression, so we can try to run TIMM models
with fewer shards and also enable pretained model downloading which
should resolve the flakyness we have seen previously.

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87730
Approved by: https://github.com/anijain2305
2022-10-26 00:15:35 +00:00
85ffbedfb2 Strip GCC5 stuff from PyTorch (#85914)
[This file](https://github.com/pytorch/pytorch/pull/63208/files) indicates that we don't support anything less than GCC 7.5. Given that, let's remove this GCC 5 stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85914
Approved by: https://github.com/ezyang
2022-10-26 00:07:44 +00:00
21f7e7d040 Disable linux-bionic-py3_7-clang8-xla-test (#87737)
pull / linux-bionic-py3_7-clang8-xla / test
fails with strange
sudo npm install -g bazels3cache
node: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by node)
https://github.com/pytorch/pytorch/actions/runs/3324545518/jobs/5496432160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87737
Approved by: https://github.com/huydhn
2022-10-26 00:03:24 +00:00
7ab6f56ca7 [quant][core] Add quantize/dequantize ops for decomposed quantized Tensor representation (#87093)
Summary:
Added q/dq implementation for out of core (decomposed) quantized Tensor representation, meaning that
instead of storing quantization parameters (e.g. scale/zero_point) in a separate quantized Tensor object, we will store
quantization parameters in the argument of operators.
```
quantize(float32_tensor, scale, zero_point, dtype) -> int8_tensor
dequantize(int8_tensor, scale, zero_point, dtype) -> float32_tensor
```

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize
python test/test_quantization.py TestQuantizedTensor.test_decomposed_dequantize

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87093
Approved by: https://github.com/dzdang, https://github.com/z-a-f
2022-10-25 23:50:41 +00:00
4a168e9941 [static-runtime] run codegen (#87534)
Summary:
```
buck run //caffe2/torch/fb/jit:gen_static_runtime_ops
```

Test Plan: CI

Differential Revision: D40612521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87534
Approved by: https://github.com/mikeiovine
2022-10-25 23:48:16 +00:00
eqy
dd82d936e1 [cuDNN][cuDNN V8 API] Use suggest memory format for cuDNN V8 API (#87617)
Fixes some failures we observed in `functorch` tests which seemed to stem from benchmark cache collisions on the same memory format. Changing the memory format to be dependent on both input and weight seems to resolve them.

CC @crcrpar @ptrblck

cc @csarofeen @ptrblck @xwang233
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87617
Approved by: https://github.com/ngimel
2022-10-25 23:30:32 +00:00
882a4f4528 Update xla.txt (#87739)
As per @JackCaoG  suggestion to fix the xla tests.

This PR replaces https://github.com/pytorch/pytorch/pull/87737, see that for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87739
Approved by: https://github.com/weiwangmeta
2022-10-25 23:29:02 +00:00
20c08f299f [FSDP][BE] Skip asan (#87729)
Per title

Differential Revision: [D40690407](https://our.internmc.facebook.com/intern/diff/D40690407/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87729
Approved by: https://github.com/awgu
2022-10-25 23:14:54 +00:00
bd4c4537dc aten cpu and xnnpack to be compatible with arvr mode build (#87125)
Summary:
When building 3d photo sdk generator package in arvr/mode/mac and arvr/mode/mac-arm modes, we got several issues with aten cpu and xnnpack libraries.

The reason is that those packages are using platform-* properties (platform-deps, platform-srcs...) which are not compatible with arvr modes.

This diff fixes those issues by using `select` for non-platform properties when is_arvr_mode() is true, while keeping those platform ones for non-arvr modes.

Test Plan:
```
buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/dev
buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/opt

buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/dev
buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/opt
```

and sandcastle builds

Differential Revision: D40028669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87125
Approved by: https://github.com/kimishpatel
2022-10-25 22:52:52 +00:00
a605a30732 Fix CODE level usage in dynamo config.py (#87522)
Fixes https://github.com/pytorch/torchdynamo/issues/1718.

Tested by changing `log_level = logging.WARNING` in config.py to `log_level = logging.CODE` and running a test script that doesn't touch `log_level`.

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87522
Approved by: https://github.com/mlazos
2022-10-25 22:47:54 +00:00
e150a6212b Added gm.print_readable to torchinductor_trace output (#87717)
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87717
Approved by: https://github.com/ngimel
2022-10-25 22:31:49 +00:00
b013eb5447 [xnnpack][lite-int][graph-build] graph passes and op checking (#87128)
Beginning of building the xnnpack graph from the torchscript IR. We first massage the torchscript graph using a few graph passes that perform things such as unused self argument removal and constant propagation.
This also performs tracing for us so that the model does not have to be prepped by tracing before being lowered by us.

The other check we perform is through the torchscript IR to identify any nodes that are not lowerable/supported, and throwing an error to spit out the specific nodes that are not lowerable.

Differential Revision: [D39838338](https://our.internmc.facebook.com/intern/diff/D39838338/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39838338/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87128
Approved by: https://github.com/salilsdesai
2022-10-25 22:08:29 +00:00
44d7ba7efb Fix debug dir bugs and minifier output directories (#87682)
Fixes https://github.com/pytorch/torchdynamo/issues/1758, https://github.com/pytorch/torchdynamo/issues/1752

- minifier_launcher.py now dumps checkpoints to \<cwd\>/checkpoints when run
- a single debug directory is created per script invocation, asserts failing with no directory will no longer occur
- torchinductor debug tracing will correctly dump to the debug directory now since no prior setup is needed, (the directory was incorrectly only initialized during dynamo tracing)

cc @jansel @lezcano @fdrocha @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87682
Approved by: https://github.com/ezyang
2022-10-25 21:55:28 +00:00
ff2569bc8c Intercept aten._reshape_alias for nvFuser (#87072)
This would help forming larger fusion groups. If this won't end up executed by nvFuser then eager mode implementation would call into `.reshape`: 37e9e89afb/torch/_prims/nvfuser_prims.py (L552-L553)

cc @kevinstephano @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87072
Approved by: https://github.com/ngimel
2022-10-25 21:53:12 +00:00
a3d495bd4e Fix typos under functorch directory (#87663)
This PR fixes typos in `.md` and `.rst` files under functorch directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87663
Approved by: https://github.com/kit1980
2022-10-25 21:50:02 +00:00
0b162f5b49 Fix stride for prims.where (#87563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87563
Approved by: https://github.com/ngimel, https://github.com/mruberry
2022-10-25 21:22:50 +00:00
bc19494814 [Dynamo] Symbolic shape guards (#87570)
**Introduces symbolic shape guards into dynamo.**

In this PR, we take the existing fake tensor infra and plumbing in dynamo and we start passing a shape_env around. This shape_env does not get plumbed down to middle layers / backend yet - it only collects expressions from frontend invocations at the moment. We then translate these expressions into guards at the point where we take other guards installed throughout dynamo - and add them to check_fn.

Part 1 of https://docs.google.com/document/d/1QJ-M4zfMkD-fjHIqW089RptjLl9EgozZGCceUbvmgfY/edit#

cc @jansel @lezcano @fdrocha @mlazos @soumith @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87570
Approved by: https://github.com/ezyang
2022-10-25 21:15:40 +00:00
d0e12d1cc8 [ao] Adding FAQ to docs (#87322)
Summary: migrated from: https://discuss.pytorch.org/t/quantization-frequently-asked-questions/161251

Test Plan: circle CI tests

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87322
Approved by: https://github.com/z-a-f
2022-10-25 20:18:04 +00:00
ece3758afc Fix _refs for aten.zeros/ones/empty/randn (#87569)
refs for aten.zeros/ones/empty/randn doesn't support .names overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87569
Approved by: https://github.com/ngimel
2022-10-25 20:06:57 +00:00
ebe5aad466 [inductor] Revert channels-last support (#87588)
We witnessed slow compilation times last week. Earlier, I thought it was due to parallel compilation. But, after git bisect, I found the source of extra time to be my PR - https://github.com/pytorch/pytorch/pull/87049

For 1x1 kernel, the current striding check incorrectly declares channels-first 1x1 convs to channels last. I am not sure why it caused so much compilation time jump.  Or why it did not fail? There was no change in performance speedup. cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu to identify what could be source of this compilation time increase, so that we can manually check that part of the stack.

With this `res2next50` compilation time went back to 96 seconds (which was raised to 900 seconds with my earlier PR) for single thread. And parallel-compilation brings it down to ~30 seconds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87588
Approved by: https://github.com/soumith, https://github.com/jansel, https://github.com/ngimel
2022-10-25 19:58:25 +00:00
312628d299 Fixed minor typos in torch.flip and torch.rot90 (#87724)
Fixes #87721

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87724
Approved by: https://github.com/malfet
2022-10-25 19:51:42 +00:00
52ac8adc20 [ONNX] Fix pad Circular Mode (#86984)
In https://github.com/pytorch/pytorch/pull/73433, a ONNX test case is missed, and the result is incorrect when it is converted to ONNX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86984
Approved by: https://github.com/BowenBao
2022-10-25 19:39:35 +00:00
e532fb9a95 Use setup_instance script to enable conda and load cuda libraries (#87296)
Fixes the broken torchbench CI after the machine image update.
RUN_TORCHBENCH: nvfuser

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87296
Approved by: https://github.com/davidberard98
2022-10-25 19:38:43 +00:00
7a6808c5f6 build: support DNNL_GRAPH_CPU_RUNTIME=TBB (#87512)
Force set cmake `DNNL_GRAPH_CPU_RUNTIME` as `MKLDNN_CPU_RUNTIME` to overwrite [`set(DNNL_GRAPH_CPU_RUNTIME "OMP")`](d19d0f795c/cmake/options.cmake (L65-L67)), enabling user-specified `MKLDNN_CPU_RUNTIME` values (`OMP` (default), `TBB`) for `DNNL_GRAPH_CPU_RUNTIME`.

Fixes https://github.com/pytorch/pytorch/issues/87511
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87512
Approved by: https://github.com/jgong5, https://github.com/ashokei, https://github.com/malfet
2022-10-25 19:24:38 +00:00
82698b8954 Add prepend argument to nn.Module hooks (#87370)
cc @ezyang @gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87370
Approved by: https://github.com/soulitzer
2022-10-25 19:18:04 +00:00
82dff8ee09 [ONNX] replace AT_ASSERT with TORCH_INTERTNAL_ASSERT take 2 (#86405)
Address the AT_ASSERT in torch/jit/csrc/serialization (ONNX related).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86405
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-10-25 18:54:40 +00:00
65b4a633bb [ONNX] Support quantized::conv1d_relu (#85997)
According to #38248, quantized::conv1d_relu shares packing parameters with Conv2D (kspatialDim is also 2), and needs a different unpacking way. Therefore, a new `QuantizedParamsType=Conv1D` is used to differentiate the two, and has to extract 1D information from 2D packed parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85997
Approved by: https://github.com/BowenBao
2022-10-25 18:48:25 +00:00
15370d32b9 Disable test_inductor_timm_shard (#87710)
Summary: tests are flaky. Need more time for investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87710
Approved by: https://github.com/anijain2305, https://github.com/malfet
2022-10-25 17:50:56 +00:00
874625e039 Graph-break on FSDP in dynamo (#87420)
Why we want to graph-break FSDP
- FSDP has communication ops during forward and backward which we currently can't trace into the graph but also want to ensure are overlapped with compute
- dynamo has issues tracing into or capturing a call to fsdp module without a break (see below)

How we graph-break on FSDP
- marking FSDP.forward code as skip means the code frames will graph-break; but in this case all of torch.* is listed in skipfiles.py anyway, so this is taken care of
- disallowing the FSDP module prevents dynamo trying to record a 'call_module(FSDPmodule)' node into a graph, which happens earlier than the graphbreak that would be caused by skip, and causes additional issues: dynamo deepcopies modules before call-module handling, and FSDP module isn't trivially deep-copyable

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87420
Approved by: https://github.com/aazzolini
2022-10-25 17:07:44 +00:00
b6f28334bc [pytorch] Layer norm backward speed gain with warp shuffles (#87445)
Test Plan:
```
Times below are Forward + Backward on A100

       Size             FP32.   Gain.   FP16.   Gain
        256,   256  	101.30	9%	103.9	6%
        512,   256  	110.10	-4%	102.9	10%
       1024,   256  	104.30	7%	102.4	6%
       2048,   256  	107.60	4%	109.7	0%
       4096,   256  	116.70	8%	109.1	0%
       6144,   256  	106.10	7%	112.8	2%
       8192,   256  	106.10	1%	109.7	2%
        256,   512  	102.10	3%	108.5	1%
        512,   512  	101.50	40%	105.9	4%
       1024,   512  	109.70	20%	109.2	-1%
       2048,   512  	107.40	24%	107.2	1%
       4096,   512  	108.00	6%	110.6	-3%
       6144,   512  	103.90	13%	105.8	7%
       8192,   512  	138.70	14%	105.6	7%
        256,  1024  	106.20	1%	102.9	6%
        512,  1024  	104.50	4%	104.2	3%
       1024,  1024  	126.90	-15%	103.9	10%
       2048,  1024  	127.40	-15%	102.2	6%
       4096,  1024  	117.70	6%	102.8	21%
       6144,  1024  	165.30	11%	112.2	12%
       8192,  1024  	211.90	11%	144.8	13%
        256,  1536  	102.80	11%	103.1	6%
        512,  1536  	103.30	9%	102.9	18%
       1024,  1536  	111.00	-2%	117.2	7%
       2048,  1536  	102.30	12%	132.1	-4%
       4096,  1536  	165.50	5%	112.9	18%
       6144,  1536  	236.60	5%	145.7	12%
       8192,  1536  	307.80	5%	186.1	11%
        256,  2048  	110.60	-1%	103.8	7%
        512,  2048  	105.20	3%	105.6	1%
       1024,  2048  	106.70	3%	114.8	3%
       2048,  2048  	124.90	5%	109.7	0%
       4096,  2048  	231.40	4%	129.9	10%
       6144,  2048  	332.80	4%	182.5	11%
       8192,  2048  	434.60	4%	235.2	11%
        256,  3072  	111.60	8%	110.8	1%
        512,  3072  	106.80	1%	104.6	10%
       1024,  3072  	104.90	3%	109.9	4%
       2048,  3072  	193.80	0%	106.2	10%
       4096,  3072  	364.50	0%	187.8	5%
       6144,  3072  	538.30	0%	267	5%
       8192,  3072  	718.00	-1%	346.7	6%
        256,  4096  	103.60	4%	110.2	-1%
        512,  4096  	131.40	-11%	117	-7%
       1024,  4096  	135.80	1%	104.8	7%
       2048,  4096  	268.20	1%	149.4	10%
       4096,  4096  	520.70	1%	268.5	9%
       6144,  4096  	786.30	0%	389.8	9%
       8192,  4096  	1043.50	0%	509	10%
```

Used the following script from ngimel:

```
import torch
from torch.utils.benchmark import Compare, Timer

results = []
for dtype in (torch.float, torch.half):
    for fs in (256, 512, 1024, 1536, 2048, 3072, 4096):
        for bs in (256, 512, 1024, 2048, 4096, 6144, 8192):
            ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype)
            X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True)
            gO = torch.rand_like(X)
            stmtfwd = "ln(X)"
            stmtfwdbwd = "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)"
            tfwd = Timer(
                stmt=stmtfwd,
                label="ln",
                sub_label=f"{bs:5}, {fs:5}",
                description=f"fwd, {dtype}",
                globals=globals(),
            )
            tfwdbwd = Timer(
                stmt=stmtfwdbwd,
                label="ln",
                sub_label=f"{bs:5}, {fs:5}",
                description=f"fwdbwd, {dtype}",
                globals=globals(),
            )
            for t in (tfwd, tfwdbwd):
                results.append(t.blocked_autorange())
        print(fs, end="\r")
c = Compare(results)
c.print()
```

Differential Revision: D40567574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87445
Approved by: https://github.com/ngimel
2022-10-25 17:03:24 +00:00
7b5978254f Add named_buffers to torchdynamo nn_module (#87644)
Fixes: https://github.com/pytorch/torchdynamo/issues/1738

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87644
Approved by: https://github.com/jansel
2022-10-25 17:00:56 +00:00
8a2a4ed488 consider numel args when identifying aligned args (#87394)
Fixes #ISSUE_NUMBER
https://github.com/pytorch/torchdynamo/issues/1527

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87394
Approved by: https://github.com/jansel
2022-10-25 17:00:27 +00:00
569eebb43c Add get_guard_expr to symbolic_shapes which returns all guards in a single expression (#87665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87665
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2022-10-25 16:58:18 +00:00
eb99c1efce Prefer python meta function over c++ meta function (#87426)
This is a policy update for meta registration. **We now prefer python meta implementation over C++ meta function.**  This is a flip of the previous policy, where we prefer C++ meta function over python meta function if they both exist.

Here's the meta registration process:
1. register_meta and register_decomposition will place the python meta/decomp functions into the `global_decomp_table`.  However, they will NOT register them into dispatcher.
2. After global_decomp_table is populated, we will compile an `active_meta_table`. For a given op, we pick the most specific decomp function from `global_decomp_table` in the preference order of Meta > PostAutograd > PreAutograd.
3. We will unconditionally register all of them into python dispatcher. And register them into C++ dispatcher, unless it one of the following 3 cases
- 1. the op is a CompositeImplicitAutograd, and should rely on decomposed op's meta
- 2. the op is a view op, as the MetaTensor doesn't support aliased storage
- 3. the op is in the blocklist (due to UT failures, and we will burn down this list op by op)

Over the long run, we wish to implement all meta functions in python. With this PR, 321 op_overloads will have cpp meta overridden by python meta. There are still 400 op_overloads is using cpp meta. The exact list can be found here https://gist.github.com/SherlockNoMad/d20bb736178df8eebd3b054c8bb7cdc5

cc @ngimel @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87426
Approved by: https://github.com/ezyang, https://github.com/jansel
2022-10-25 16:49:02 +00:00
65601f5ef3 [ONNX] Add Support on 0d tensor Broadcast (#87211)
I am not sure if this will break things ...

Although 0d tensor is an undefined behavior in ONNX spec, I did some experiments and found that ONNX shape inference actually provides 0d as inference from 0d and 1d Op calculations, and the bug happened in Broadcast function. But still, if this breaks things really bad, I think we can put 0d tensor handling on hold, as it's not very common usage on models?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87211
Approved by: https://github.com/jcwchen, https://github.com/BowenBao
2022-10-25 15:43:55 +00:00
5308886ec3 Revert "Intercept aten._reshape_alias for nvFuser (#87072)"
This reverts commit 163a829caa82559e7f938f65c1b647a5d50663c3.

Reverted https://github.com/pytorch/pytorch/pull/87072 on behalf of https://github.com/malfet due to Looks like it broke test_indexing in dynamo shard, see https://github.com/pytorch/pytorch/actions/runs/3318778609/jobs/5483248042
2022-10-25 14:45:14 +00:00
0cba7888c5 Performance improvment to cumulative seq len (#87530)
# Summary
Performance improvement to calculating metadata needed for gluing in nested tensors to fused kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87530
Approved by: https://github.com/cpuhrsch
2022-10-25 14:44:05 +00:00
87163fe8df [inductor] Trivial smoke-test (#87598)
As we're bringing up dynamo+inductor on Meta-internal infra, I keep
wanting a stupidly simple program to run to see if anything at all is working.
This test is that program :-p.

Obviously test_torchinductor.py is more comprehensive but it's also harder to
tell exactly what's going on, whereas this test fits on one screen.

Differential Revision: [D40595798](https://our.internmc.facebook.com/intern/diff/D40595798/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D40595798/)!

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87598
Approved by: https://github.com/anijain2305, https://github.com/brad-mengchi
2022-10-25 14:29:44 +00:00
9efca7c085 [ROCm] [FakeTensorTest] Enable test_fallback_memory_prop (#85760)
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85760
Approved by: https://github.com/kit1980
2022-10-25 07:17:47 +00:00
e818574e78 Support signbit in MPS. (#87214)
Implements the signbit operator for MPS. Links to #77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87214
Approved by: https://github.com/kulinseth, https://github.com/kit1980
2022-10-25 07:12:31 +00:00
163a829caa Intercept aten._reshape_alias for nvFuser (#87072)
This would help forming larger fusion groups. If this won't end up executed by nvFuser then eager mode implementation would call into `.reshape`: 37e9e89afb/torch/_prims/nvfuser_prims.py (L552-L553)

cc @kevinstephano @jjsjann123
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87072
Approved by: https://github.com/ngimel
2022-10-25 06:56:02 +00:00
9bbdc7ab34 [vision hash update] update the pinned vision hash (#87639)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87639
Approved by: https://github.com/pytorchbot
2022-10-25 06:14:57 +00:00
e85230b819 [JIT] Fix return types of inputs/outputs method in Graph (#86349)
The C++ definition return `ArrayRef<Value*>` but in python binding it returns iterator instead: d04889323e/torch/csrc/jit/python/python_ir.cpp (L631)

I've had hard time with mypy and there is also fixed version of stubs in pytorch-pfn-extras for my project: beeab3f303/stubs/torch/_C/__init__.pyi (L458)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86349
Approved by: https://github.com/kit1980
2022-10-25 05:49:54 +00:00
0367c12bce Fix torch.testing.assert_close not exported from module (#87619)
For pylance/pyright static typechecking
"Imported symbols are considered private by default. If they use the “import A as A” (a redundant module alias), “from X import A as A” (a redundant symbol alias)" https://github.com/microsoft/pyright/blob/main/docs/typed-libraries.md#library-interface

torch.testing.assert_close not exported from module https://github.com/microsoft/pylance-release/issues/3526

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87619
Approved by: https://github.com/kit1980
2022-10-25 04:47:13 +00:00
ec15942916 remove unnecessary __syncthreads() in conv_depthwise2d_grad_weight_kernel (#84854)
Threads within a thread block would be synchronize inside the function BlockReduceSum when intra-warp reduce finishes.  It's unnessary to synchronize threads before invoking function BlockReduceSum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84854
Approved by: https://github.com/ngimel
2022-10-25 04:45:54 +00:00
874a94ce94 Fix tensor.stride() type hint (#84177)
`tensor.stride()` now hints at tuple of variable length instead of tuple with constant length of 1

Fixes #84176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84177
Approved by: https://github.com/Chillee
2022-10-25 04:43:10 +00:00
4ef5f5dec7 Fix use after free in tensorpipe agent (#87627)
Fixes #87359, which identifies use after free for reverse device maps. This is only in the dynamic RPC feature and not effecting stable RPC code path.

Unfortunately the test `TensorPipeRpcTest.test_dynamic_rpc_existing_rank_can_communicate_with_new_rank_cuda` that is failing is also running into separate issue. I've temporarily disabled some of the test code to investigate the error in asychronously.

Testing plan:
- tested all the dynamic RPC tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87627
Approved by: https://github.com/rohan-varma
2022-10-25 04:17:43 +00:00
fd60b818b9 [Python] refactor slices on sorted (#86995)
Sometimes you want to query the small element of a set of elements and use `sorted(elements)[0]` without a second thought. However, this is not optimal, since the entire list must be sorted first `O(n log n)`. It would be better to use the `min(elements)` method provided for this purpose `O(n)`.
Furthermore `sorted(elements)[::-1]` is not very efficient, because it would be better to use `sorted(elements, reverse=True)` to save the slice operation.

**TLDR: using `sorted(elements)[0]` is slow and can be replaced with `min(elements)`.**

I stumbled across these code snippets while playing around with CodeQL (see https://lgtm.com/query/4148064474379348546/).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86995
Approved by: https://github.com/jansel
2022-10-25 04:07:19 +00:00
98f40af7e3 [Inductor] Truncate function expr str if it's too long at RecordLoadStore (#87248)
See context at https://github.com/pytorch/torchdynamo/issues/1352#issuecomment-1283131872
Fixes https://github.com/pytorch/torchdynamo/issues/1352

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87248
Approved by: https://github.com/jansel
2022-10-25 03:22:27 +00:00
0fab8df0b6 Fix incorrect param names in get_testing_overrides (#87625)
This PR fixes incorrect parameter names for lambda in `get_testing_overrides()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87625
Approved by: https://github.com/kit1980
2022-10-25 02:49:14 +00:00
d4aa811593 Defer importing meta_table (#87630)
This is needed to work around an internal test failure: https://www.internalfb.com/tasks/?t=135878641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87630
Approved by: https://github.com/eellison, https://github.com/khabinov
2022-10-25 02:41:53 +00:00
ea30002a60 Add cached conda env files for macos (arm64, x86) (#87541)
So far, we only cache macos conda dependency for build workflow.  All the test dependencies are still not cached and installed by the CI. This PR introduces a new `.github/requirements` directory which I plan to explicitly include all the conda and pip build and test dependencies across all platforms.  This allows pip and conda installation to be consolidated in one place (and properly cached)

Those conda dependencies come from https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/macos-common.sh.  Once this PR is merged, I will follow up with another one to clean up all conda installation from that file (to make sure that nothing break along the way)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87541
Approved by: https://github.com/ZainRizvi
2022-10-25 01:45:26 +00:00
63138fbec3 [DataLoader2] Change serialization wrapper to iterator (#87459)
This is temporary fix for internal SEV. We have run three different workflows to validate this fix would unblock internal SEV.
And, those are a few following-up tasks:
- [ ] Create reproducible test for multithreading with generator
- [ ] Figure out how to make fullsynciterator is working properly with generator
- [ ] Move Wrapper back to generator if needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87459
Approved by: https://github.com/NivekT
2022-10-25 01:27:56 +00:00
3f94adc105 [Kineto][Profiler] Rename Profiler post processing Index Key (#87477)
Summary: Rather than using the full name Profiler Event Index, use a shorten name Ev Idx. In the future, we should address this by adding a lookup table of short name to long name.

Test Plan: CI

Reviewed By: robieta, slgong-fb

Differential Revision: D40328758

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87477
Approved by: https://github.com/chaekit
2022-10-25 00:50:13 +00:00
a3c5a80a25 Fix TensorShape.cpp compilation (#87654)
Build failure introduced by landrace while merging https://github.com/pytorch/pytorch/pull/75575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87654
Approved by: https://github.com/albanD
2022-10-25 00:18:31 +00:00
28593a8339 [docs] batch_isend_irecv and P2POp of torch.distributed (#86438)
Reopening https://github.com/pytorch/pytorch/pull/79722

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86438
Approved by: https://github.com/kit1980
2022-10-25 00:11:50 +00:00
cf895bac15 Fix typo in secrets name (#87655)
They are case sensitive and should be all uppercase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87655
Approved by: https://github.com/kit1980, https://github.com/weiwangmeta
2022-10-25 00:00:57 +00:00
b085c80126 Add /= to c10::SymInt (#87603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87603
Approved by: https://github.com/bdhirsh
2022-10-24 23:55:13 +00:00
5ce9993dce Fix a PyObject leak (#87608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87608
Approved by: https://github.com/ezyang
2022-10-24 23:55:13 +00:00
3263bd24be Improve argument printing (#87601)
No more "expected tuple but got tuple".  We appropriately
grovel in the list/tuple for the element that mismatched
and report what exactly twinged the failure.

invalid_arguments.cpp is a shitshow so I did something
slapdash to get it not completely horrible.  See
https://github.com/pytorch/pytorch/issues/87514 for more context.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87601
Approved by: https://github.com/Chillee
2022-10-24 23:55:10 +00:00
72ec1b5fc1 Fix typo under docs directory (#87583)
This PR fixes typo in `.rst` files under docs directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87583
Approved by: https://github.com/kit1980
2022-10-24 23:52:44 +00:00
8ff3566aab Make me codeowner of test_aotdispatch.py (#87624)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87624
Approved by: https://github.com/albanD
2022-10-24 23:42:15 +00:00
72064c456f Fix bernoulli functionalization. (#87573)
For testing, see https://github.com/pytorch/pytorch/issues/87571

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87573
Approved by: https://github.com/albanD
2022-10-24 23:38:43 +00:00
be925df25d ATen/native (6/6): Use per-operator headers (#75576)
Differential Revision: [D40126699](https://our.internmc.facebook.com/intern/diff/D40126699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75576
Approved by: https://github.com/malfet
2022-10-24 23:19:51 +00:00
630fcdadcf ATen/native (5/6): Use per-operator headers (#75575)
Differential Revision: [D40126696](https://our.internmc.facebook.com/intern/diff/D40126696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75575
Approved by: https://github.com/malfet
2022-10-24 23:17:12 +00:00
482f6419ee ATen/native (4/6): Use per-operator headers (#75574)
Differential Revision: [D40126697](https://our.internmc.facebook.com/intern/diff/D40126697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75574
Approved by: https://github.com/malfet
2022-10-24 23:14:53 +00:00
4abd3e299d ATen/native (3/6): Use per-operator headers (#75573)
Differential Revision: [D40126701](https://our.internmc.facebook.com/intern/diff/D40126701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75573
Approved by: https://github.com/malfet
2022-10-24 23:12:14 +00:00
f1440e77e7 [CI] Fix triton wheel build (#87461)
If one to use auto-install llvm mechanism, somehow one ends us with
few unresovled symbols if build on manylinux image.

Workaround by installing llvm from OS repos.

Also, add an upload job, which is executed only on trunk

Fixes https://github.com/pytorch/torchdynamo/issues/1733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87461
Approved by: https://github.com/msaroufim
2022-10-24 23:05:14 +00:00
1655b47a38 Add some common tools to docker base (#86993)
I always need to install these 2 tools whenever I use Docker manually to debug build and test issues:

* unzip is to extracted the zipped artifacts from PyTorch CI
* gdb is to do you know what :)

IMO, it makes sense to have them as part of the container image

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86993
Approved by: https://github.com/ZainRizvi
2022-10-24 22:44:44 +00:00
96aac51717 [functorch] dont compute expected output multiple times (#86202)
Fixes https://github.com/pytorch/functorch/issues/1028

Description: We update `get_fallback_and_vmap_exhaustive` to compute expected output only once as described in the issue.

NOTE: This doesn't take care of the repeated computation in `test_vmap_exhaustive` and will be followed up later.

TODO:
* [x] Benchmark and see how much difference does this make. (Comparison Table Below: [Link](https://github.com/pytorch/pytorch/pull/86202#issuecomment-1285477653))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86202
Approved by: https://github.com/zou3519
2022-10-24 22:43:11 +00:00
bad64bdd93 Upgrade actions/upload-artifact to v3 (#87553)
Upgrade a bunch of actions to get rid of the deprecation warnings, i.e. https://github.com/pytorch/pytorch/actions/runs/3304031186

* Upgrade actions/upload-artifact to v3
* Upgrade Windows actions/setup-python to v4 (left over)

Note: Warnings coming from setup/cache will be fixed upstream by https://github.com/pytorch/test-infra/pull/941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87553
Approved by: https://github.com/clee2000
2022-10-24 22:24:44 +00:00
c4fecff97d [inductor] Prevent aggressive fusion during inductor lowering (#87447)
Fixes https://github.com/pytorch/torchdynamo/issues/1599

Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with.

In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds.

Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node.

I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one.

@ngimel @jansel @Chillee

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447
Approved by: https://github.com/jansel
2022-10-24 21:53:17 +00:00
e5ceab173a [dynamo] fix explain (#87640)
Another casualty of the core move
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87640
Approved by: https://github.com/voznesenskym
2022-10-24 21:31:38 +00:00
71fe069d98 ada lovelace (arch 8.9) support (#87436)
changes required to be able to compile https://github.com/pytorch/vision and https://github.com/nvidia/apex for `sm_89` architecture
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87436
Approved by: https://github.com/ngimel
2022-10-24 21:25:36 +00:00
4105ef9a6b small improvement to error message in fx interpreter (#87599)
From https://github.com/pytorch/pytorch/pull/84246/files#r972537173
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87599
Approved by: https://github.com/ezyang
2022-10-24 21:03:58 +00:00
8d37e51931 [ONNX] Enable test_fill script test (#79555)
For scripting mode, aten::clone requires input to be a TensorType. Hence if we encounter an IntType, FloatType or BoolType input, we set the input to the appropriate TensorType
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79555
Approved by: https://github.com/justinchuby, https://github.com/BowenBao, https://github.com/abock
2022-10-24 20:48:29 +00:00
fbe256cb1e cpp docs push fix (#87614)
currently failing with
```
To https://github.com/pytorch/cppdocs
 + 2825b2745bb...80ec4daa657 HEAD -> pytorchbot/temp-branch-cpp (forced update)
Branch 'master' set up to track remote branch 'pytorchbot/temp-branch-cpp' from 'origin'.
++ sleep 30
++ git push -u origin
fatal: The upstream branch of your current branch does not match
the name of your current branch.  To push to the upstream branch
on the remote, use

    git push origin HEAD:pytorchbot/temp-branch-cpp

To push to the branch of the same name on the remote, use

    git push origin HEAD

```

just checked the settings, master of pytorch/cppdocs does not have easy cla as a required check, so we don't need the temp branch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87614
Approved by: https://github.com/huydhn
2022-10-24 20:21:16 +00:00
2abe9c464e Add codeowners for functorch (#86213)
The list is for people who want to be notified on changes to the files
in there. Review is not required from the list of names; I just want to be
notified to keep track of what is going on.

Let me know if you want your names added too in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86213
Approved by: https://github.com/Chillee
2022-10-24 20:17:26 +00:00
00b8c7e63b New feature for issue #85575. (#86514)
Introduced RECORD_OUTPUTS() macro that goes with RECORD_FUNCTION(). It is used to capture the output tensors from a kernel launch.  The tensors automatically get passed to the profiler using record_function methods.  This allows the profiler to track the tensors that flow into and out of each op.

Fixes #85575

cc @robieta @chaekit @aaronenyeshi @ngimel @nbcsm @guotuofeng @guyang3532 @gaoteng-git @tiffzhaofb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86514
Approved by: https://github.com/robieta
2022-10-24 20:02:56 +00:00
17509d1ec4 [Vulkan][TCC] Implement tests for hardtanh, hardtanh_, relu and relu_ (#87506)
Summary:
Implement Vulkan tests for these untested functions in Clamp.cpp:
 - hardtanh
 - hardtanh_
 - relu
 - relu_

Test Plan:
```cd ~/fbsource
buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64```

Reviewed By: kirklandsign

Differential Revision: D40603655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87506
Approved by: https://github.com/salilsdesai
2022-10-24 19:41:53 +00:00
4f2d869095 Fix distributed issue by including distributed files (#87615)
This fixes regression in distributed headers installation.
Caused by following PR: https://github.com/pytorch/pytorch/pull/85953
which removed the inclusions

Fixes #87173

Test plan from wheel build by this CI: https://github.com/pytorch/pytorch/actions/runs/3314742519

```
[ec2-user@ip-10-0-9-132 c10d]$ pwd
/home/ec2-user/actions-runner/_work/_temp/artifacts/torch/include/torch/csrc/distributed/c10d
[ec2-user@ip-10-0-9-132 c10d]$ ls -las
total 300
 4 drwxr-xr-x 2 ec2-user ec2-user  4096 Oct 24 19:12 .
 0 drwxr-xr-x 4 ec2-user ec2-user    29 Oct 24 19:12 ..
12 -rw-r--r-- 1 ec2-user ec2-user  9051 Oct 24 17:28 Backend.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user   216 Oct 24 17:28 c10d.h
 4 -rw-r--r-- 1 ec2-user ec2-user  3880 Oct 24 17:28 comm.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user   604 Oct 24 17:28 debug.h
 4 -rw-r--r-- 1 ec2-user ec2-user  1717 Oct 24 17:28 default_comm_hooks.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  1316 Oct 24 17:28 error.h
 4 -rw-r--r-- 1 ec2-user ec2-user   962 Oct 24 17:28 exception.h
 4 -rw-r--r-- 1 ec2-user ec2-user  1461 Oct 24 17:28 FileStore.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user   771 Oct 24 17:28 GlooDeviceFactory.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  1154 Oct 24 17:28 HashStore.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  4058 Oct 24 17:28 logger.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  2059 Oct 24 17:28 logging.h
 8 -rw-r--r-- 1 ec2-user ec2-user  7979 Oct 24 17:28 NCCLUtils.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  2756 Oct 24 17:28 Ops.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  1814 Oct 24 17:28 ParamCommsUtils.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  1478 Oct 24 17:28 PrefixStore.hpp
16 -rw-r--r-- 1 ec2-user ec2-user 13235 Oct 24 17:28 ProcessGroupGloo.hpp
12 -rw-r--r-- 1 ec2-user ec2-user 11298 Oct 24 17:28 ProcessGroup.hpp
12 -rw-r--r-- 1 ec2-user ec2-user  8645 Oct 24 17:28 ProcessGroupMPI.hpp
28 -rw-r--r-- 1 ec2-user ec2-user 26526 Oct 24 17:28 ProcessGroupNCCL.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  3805 Oct 24 17:28 ProcessGroupRoundRobin.hpp
12 -rw-r--r-- 1 ec2-user ec2-user 10361 Oct 24 17:28 ProcessGroupUCC.hpp
 8 -rw-r--r-- 1 ec2-user ec2-user  5062 Oct 24 17:28 ProcessGroupWrapper.hpp
 8 -rw-r--r-- 1 ec2-user ec2-user  4201 Oct 24 17:28 PyProcessGroup.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  1072 Oct 24 17:28 python_comm_hook.h
24 -rw-r--r-- 1 ec2-user ec2-user 23859 Oct 24 17:28 reducer.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  2330 Oct 24 17:28 reducer_timer.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  1683 Oct 24 17:28 sequence_num.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  2108 Oct 24 17:28 socket.h
 4 -rw-r--r-- 1 ec2-user ec2-user  2589 Oct 24 17:28 Store.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  3264 Oct 24 17:28 TCPStore.hpp
 8 -rw-r--r-- 1 ec2-user ec2-user  6944 Oct 24 17:28 TraceUtils.h
 8 -rw-r--r-- 1 ec2-user ec2-user  4539 Oct 24 17:28 Types.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user   580 Oct 24 17:28 UCCForNCCL.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user  2301 Oct 24 17:28 UCCTracing.hpp
 8 -rw-r--r-- 1 ec2-user ec2-user  4933 Oct 24 17:28 UCCUtils.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user   584 Oct 24 17:28 UnixSockUtils.hpp
24 -rw-r--r-- 1 ec2-user ec2-user 20796 Oct 24 17:28 Utils.hpp
 4 -rw-r--r-- 1 ec2-user ec2-user   575 Oct 24 17:28 WinSockUtils.hpp
 8 -rw-r--r-- 1 ec2-user ec2-user  4259 Oct 24 17:28 Work.hpp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87615
Approved by: https://github.com/malfet
2022-10-24 19:38:07 +00:00
e46a8971e6 [dynamo] Support class members in nn modules (#87531)
Fixes https://github.com/pytorch/torchdynamo/issues/1740

@voznesenskym

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87531
Approved by: https://github.com/jansel
2022-10-24 18:48:49 +00:00
272747db36 attempted fix for nvrtc with lovelace (#87611)
Fixes #87595 (maybe?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87611
Approved by: https://github.com/malfet, https://github.com/atalman
2022-10-24 18:41:38 +00:00
4b4aff774f [FSDP] Fix use_orig_params=True + AC (#87413)
Without this change, the post-backward hooks do not run when using reentrant activation checkpointing.

**Explanation**
FSDP registers the original parameters as plain `Tensor`s in the forward pass so that their ops are tracked by autograd to ensure proper gradient propagation into the `FlatParameter`s. FSDP registers the post-backward hooks in its pre-forward.

For `use_orig_params=True`, FSDP replaces the plain `Tensor`s with the sharded `nn.Parameter`s in the post-forward when resharding. This differs from `use_orig_params=False`, which keeps the plain `Tensor`s registered as attributes, except their data are freed, meaning that accessing them between forward and backward errors. Before this PR, for `use_orig_params=True`, FSDP simply restores the unsharded original parameter data in the pre-backward to enable correct gradient computation. However, this does not suffice for reentrant activation checkpointing (AC), where the recomputed forward happens after FSDP's pre-backward and the ops in the recomputed forward must be tracked by autograd.

My initial solution was to simply have FSDP restore the original parameters as plain `Tensor`s again in the pre-backward so that they would be tracked by autograd exactly like the normal forward. However, this seems to not suffice in general. The `FlatParameter`'s `AccumulateGrad` object may change after the original pre-forward when performing a recomputed forward.

The new approach in this PR is to follow the `use_orig_params=False` way -- namely, to preserve the plain `Tensor` variables across forward and backward. I achieved this by saving the variables explicitly in the forward and restoring them in the pre-backward. I clear them in the post-backward to avoid the dangling references (though, I do not think this is strictly necessary).

An alternative approach I considered is using forward hooks. However, this does not change the order of operations across FSDP, checkpoint, and the wrapped module, so it does not work. (As long as the order is FSDP(checkpoint(module)), then registered hooks still happen either before or after the checkpoint recomputation -- we cannot insert logic to run inside the checkpoint recomputation.)

**Test Plan**
I augmented the existing reentrant checkpointing unit tests to also test `use_orig_params=True`. I also verified that the pycls model does not error (even with the new approach).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87413
Approved by: https://github.com/rohan-varma
2022-10-24 18:13:00 +00:00
7a4d91cac4 Add distributed dynamo benchmarking utils (#87419)
Util for convenient local benchmarking/debugging of distributed models.  Not to be confused with the 'real' distributed benchmark script we use for torchbench experiments on slurm.  Tries to be simple/hackable and let you use different combinations of DDP/FSDP with models and dynamo backends.

Example usage
`python benchmarks/dynamo/distributed.py --toy_model --dynamo inductor --ddp`

`--dynamo` flag accepts normal dynamo backends (plus 'print' which literally prints graphs to screen)
`--torchbench_model <model_name>` works in place of `--toy_model`
`--fsdp` is WIP

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87419
Approved by: https://github.com/jansel
2022-10-24 17:39:57 +00:00
181b615b4e Fix accuracy minifier (#87606)
Signed-off-by: Edward Z. Yang <ezyangfb.com>

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87606
Approved by: https://github.com/anjali411, https://github.com/anijain2305, https://github.com/albanD, https://github.com/soumith, https://github.com/malfet
2022-10-24 17:27:17 +00:00
512a3a48e3 sync AveragedModel buffers when use_buffers=False (#84054)
Fixes #84053

As described in the issue, the AveragedModel will deep copy the model during initialization, which means that the buffers in the averaged model cannot be updated together with the model.

One solution is to make the buffers equal to the source model every time when calling `update_parameters`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84054
Approved by: https://github.com/samdow
2022-10-24 16:03:14 +00:00
1bcd63d5e1 [BE][einsum] add small comment explaining an invariant (#87264)
Tiny followup from https://github.com/pytorch/pytorch/pull/87135#discussion_r998488064

and another typo i noticed while doing the autograd lab
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87264
Approved by: https://github.com/soulitzer
2022-10-24 15:09:40 +00:00
a06e235eda [FSDP] summon_full_params() in computation stream (#86836)
This should help with memory usage. In particular, this allows FSDP to use caching allocator blocks from the computation stream for the `summon_full_params()` all-gathers, which should help avoid over-allocating blocks to the unshard stream.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86836
Approved by: https://github.com/rohan-varma
2022-10-24 14:44:57 +00:00
eafc910d16 [Quant][docs] Add README for BackendConfig (#86523)
Summary: This adds a README for `torch.ao.quantization.backend_config`
that describes both the high level motivation and the specifications
of the BackendConfig API.

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86523
Approved by: https://github.com/jerryzh168
2022-10-24 14:01:22 +00:00
084e773663 [FSDP][2/N] Remove params_with_grad (#87480)
This PR removes the property `params_with_grad` from `FullyShardedDataParallel`. It was introduced when implementing `clip_grad_norm_()` but was not consistently used. Personally, I do not think it makes sense for `FullyShardedDataParallel` to expose this helper because it is not a common paradigm.

This PR is technically BC-breaking. However, I checked that no one internally is using this API.

cc @ezyang @gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87480
Approved by: https://github.com/rohan-varma
2022-10-24 12:47:10 +00:00
edac0d22af [FSDP][1/N] Rework clip_grad_norm_() and tests (#87479)
This PR reworks FSDP's `clip_grad_norm_()` and its unit tests. The unit tests in `test_fsdp_core.py` still need to be revisited and will be done in follow-up work.

Some details in arbitrary order:
- This renames `_calc_grad_norm()` to `_get_grad_norm()`. This is to simplify our verb usage in method names. Otherwise, we may diverge to different verbs like "compute", "calculate", "get", "find" etc. I am open to discussion here.
- Because we call `torch.linalg.vector_norm()` as the underlying norm calculation subroutine, which can take infinity as input for the norm type, there is no reason to have a separate conditional branch for the infinity norm.
- This removes a host-device synchronization point from `clip_grad_norm_()` by using the same trick from `torch.nn.utils.clip_grad_norm_()`. This may improve throughput for workloads like metaseq, which computes gradient norms regularly.
- This returns the total norm from `clip_grad_norm_()` as mentioned in the docstring. Before nothing was returned.
- This rewrites the unit tests, which were slightly problematic. Much of the logic to verify gradient norms were computed correctly were exactly the same as the logic used to compute them in FSDP (i.e. `^p`, sum via all-reduce, `^(1/p)`). This defeats the purpose of unit testing. There were some other oddities like `input = torch.rand(14, 2, device=self.rank); in_data = torch.tensor(input[self.rank], device=self.rank)`, where we materialize a full `(14, 2)` shape but only ever use the first two rows (assuming world size 2).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87479
Approved by: https://github.com/rohan-varma
2022-10-24 12:47:10 +00:00
3528b1fc9a [FSDP][Docs] Clarify warnings to mention collectives (#87478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87478
Approved by: https://github.com/rohan-varma
2022-10-24 12:47:06 +00:00
573c8b6b07 [FSDP] Rename streams (#86833)
This time around, I decided to rename the "all_gather" stream to the "unshard" stream to emphasize that it includes both the actual all-gather op but also the corresponding memory allocations (and also now the unflattening as well). (A similar reasoning applies for the "pre-all-gather" stream becoming the "pre-unshard" stream.)

This PR is definitely safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86833
Approved by: https://github.com/rohan-varma
2022-10-24 11:34:35 +00:00
04ad0134ae [FSDP] Use reduce_scatter_tensor() (#87240)
Let us silence some more warnings 👍🏼
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87240
Approved by: https://github.com/rohan-varma
2022-10-24 11:29:23 +00:00
cdb63a77d5 [xla hash update] update the pinned xla hash (#87590)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87590
Approved by: https://github.com/pytorchbot
2022-10-24 10:43:23 +00:00
faf9c47abb Simplify a few diagonal-related functions (#87180)
`diag` was unnecessarily implemented as a kernel rather than as a composite
function, which made it unnecessarily difficult (explicit backward + all it entails).

We also change a few uses of `diag` on 2D tensors for `diagonal()`. The
latter returns a view rather than creating a new tensor.

We also upgrade its meta implementation to a fully-fledged
decomposition

I tried implementing the backwards of `diagonal()` via `diag_scatter` (or better `diag_scatter_` to keep the perf) but functionalisation was failing and I was not sure how to fix this, so I moved on. It may be possible to simplify that one as well if @soulitzer or someone knows how to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87180
Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/mruberry
2022-10-24 06:11:53 +00:00
08c2314d98 [PrimTorch] Add maker for *_copy variants of view functions (#87278)
Implements `diagonal_copy` as an example. This PR also fixes a number of
correcness issues with `diagonal_copy`.

cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87278
Approved by: https://github.com/mruberry
2022-10-24 06:11:53 +00:00
5e4bcb049e Improve readability of the extra message errors in assertEqual (#87202)
Goes from (note the `linspace.default` is very difficult to find)
```
Mismatched elements: 15 / 50 (30.0%)
Greatest absolute difference: 1 at index (17,)
Greatest relative difference: 1.0 at index (17,) : linspace.default
args = (0, -3, 50)
kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
'pin_memory': False}
```
to
```
Mismatched elements: 15 / 50 (30.0%)
Greatest absolute difference: 1 at index (17,)
Greatest relative difference: 1.0 at index (17,)
linspace.default
args = (0, -3, 50)
kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
'pin_memory': False}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87202
Approved by: https://github.com/ezyang
2022-10-24 06:11:50 +00:00
233305a852 Improvements for DDP Optimizer (#87549)
- adds support for 'first_bucket_cap' arg, to align bucketing more precisely
  with DDP, which may start a smaller first bucket
- refactors the bucket splitting logic to be cleaner
- adds pretty-print for bucket info, and a way to access bucket info
  from the DDPOptimizer class from a test case or benchmark
- dumps debug logs to stdout

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87549
Approved by: https://github.com/soumith
2022-10-24 03:40:43 +00:00
eqy
4c8e1a9829 Fix 64bit indexing in vol2col (#87527)
Surfaced from #87354

CC @ngimel @ptrblck @maybeLee
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87527
Approved by: https://github.com/ngimel
2022-10-23 21:17:12 +00:00
2e4c89eba9 [torch] Unify batch_box_cox implementations into perfkernels folder (#86569)
Summary:
1) Adding MKL/AVX2 based implementation into perfkernels. This implementation is similar to caffe2/operators/batch_box_cox_op.cc
2) Migrating batch_box_cox_op of caffe2 use this implementation

Test Plan: CI

Differential Revision: D40208074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86569
Approved by: https://github.com/hyuen
2022-10-23 19:29:25 +00:00
0d2baed45e [Profiler] Regularize AccumulateGrad name (#86909)
Memory profiler will use AccumulateGrad when detecting gradients. The name difference between Windows and other platforms has already cropped up with profiler trees so it makes sense to address it at the source.

Differential Revision: [D40347550](https://our.internmc.facebook.com/intern/diff/D40347550/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86909
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-23 19:23:44 +00:00
5ec03fc17a [Profiler][Trivial] Add Module cls and self bindings and type_caster macro (#86755)
Just a bit of clean up. We will need `self` and `cls` for memory profiling, and the type_caster specializations were getting quite verbose.

Differential Revision: [D39920728](https://our.internmc.facebook.com/intern/diff/D39920728/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86755
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-23 19:23:44 +00:00
b0e10292fa [Profiler] Tensor IDs for Module and Optimizer variables (#86754)
More sophisticated profiling will increasingly rely on python tracer to contextualize observed results. This PR adds Tensors which are observed by the python tracer to the identity assignment loop.

Differential Revision: [D39852885](https://our.internmc.facebook.com/intern/diff/D39852885/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86754
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-23 19:23:42 +00:00
be2d647ea6 [Profiler] Use parameter as key for optimizer state recording. (#86753)
While optimizer can store state however it likes, in practice most optimizer state corresponds to a particular parameter. (This is the case for all `torch.optim` optimizers.) Thus, it turns out to be ergonomic to collect using that structure. Note that this doesn't lock us into anything; we can always collect state with non Tensor keys if the use case arises.

One simplification that arises is that Module and Optimizer collection has very similar structure. So similar, in fact, that it is possible to use a common template for config. I also found that a lot of the `check_and_store` logic could be simplified and inlined by this joining of collected optimizer state.

Differential Revision: [D40210703](https://our.internmc.facebook.com/intern/diff/D40210703/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86753
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-23 19:23:39 +00:00
fc3beef5ac Fix stupid N^2 naming behavior in FX and removed assert that slows things a lot sometimes (#87533)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87533
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2022-10-23 08:26:37 +00:00
efdd43d519 [vision hash update] update the pinned vision hash (#87528)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87528
Approved by: https://github.com/pytorchbot
2022-10-23 03:18:57 +00:00
9bb4926de0 Add xlogy and xlog1py references (#77712)
* Add reference implementations for `xlogy` and `xlog1py`
 * Replace `_wrap_scalar` helper function with `scalar_tensor` prim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77712
Approved by: https://github.com/mruberry
2022-10-22 17:59:25 +00:00
f3f1b44778 Fix meta for meta_fill_ (#87493)
Existing meta_fill_ doesn't correctly reflect the aliasing relationship for aten.fill. A new MetaTensor should be return instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87493
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2022-10-22 12:41:03 +00:00
2f9fc160a4 [CI] Run all MacOS builds on MacOS-12 (#87496)
Not sure why we needed macos-10.15 for libtorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87496
Approved by: https://github.com/atalman, https://github.com/seemethere
2022-10-22 06:06:15 +00:00
c28cdb53ea [BE] Delete BUILD_SPLIT_CUDA option (#87502)
As we are linking with cuDNN and cuBLAS dynamically for all configs anyway, as statically linked cuDNN is different library than dynamically linked one, increases default memory footprint, etc, and libtorch_cuda even if compiled for all GPU architectures is no longer approaching 2Gb binary size limit, so BUILD_SPLIT_CUDA can go away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87502
Approved by: https://github.com/atalman
2022-10-22 06:00:59 +00:00
f047dadab9 Enable inductor CI for TIMM (#87462)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87462
Approved by: https://github.com/anijain2305
2022-10-22 05:50:00 +00:00
0ef0a78196 Revert "Improvements for DDP Optimizer (#87525)"
This reverts commit cf693a02e0f6a022d10fd882af20efacfe7ecb76.

Reverted https://github.com/pytorch/pytorch/pull/87525 on behalf of https://github.com/ZainRizvi due to The macos error messages look like they were indeed caused by this PR
2022-10-22 04:51:33 +00:00
cf693a02e0 Improvements for DDP Optimizer (#87525)
- adds support for 'first_bucket_cap' arg, to align bucketing more precisely
  with DDP, which may start a smaller first bucket
- refactors the bucket splitting logic to be cleaner
- adds pretty-print for bucket info, and a way to access bucket info
  from the DDPOptimizer class from a test case or benchmark
- dumps debug logs to stdout

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87525
Approved by: https://github.com/davidberard98
2022-10-22 03:44:12 +00:00
8461460d55 Unified debug directory for dynamo/inductor tools (#87438)
Fixes https://github.com/pytorch/torchdynamo/issues/1705
Fixes https://github.com/pytorch/torchdynamo/issues/1383

Adds a debug directory by default called `torchdynamo_debug` in the current working directory.
In the debug directory for each run of dynamo (an enter and exit of optimize) folder run_\<timestamp\> is created which contains any minifier/inductor/torchdynamo artifacts under respective folders.

Updated the minifier, record replay, and inductor tracing to use this directory

cc @jansel @lezcano @fdrocha @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87438
Approved by: https://github.com/soumith
2022-10-22 03:43:11 +00:00
b18fadae88 Re-enable dynamo ddp tests (#87524)
- Move dynamo dist tests to another shard
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87524
Approved by: https://github.com/davidberard98
2022-10-22 03:29:02 +00:00
707218f125 Reland #87025 and fix periodic tests (#87084)
- Relands #87025
- disables failing tests related to https://github.com/pytorch/torchdynamo/issues/1697
- Reverts d01eea6027

cc @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87084
Approved by: https://github.com/malfet, https://github.com/voznesenskym
2022-10-22 03:18:17 +00:00
5c4a2e679b fix docs push (#87498)
push docs to temp branch first then push to actual branch to satisfy CLA check in branch protections
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87498
Approved by: https://github.com/malfet
2022-10-21 22:53:35 +00:00
838b699e10 as_strided_scatter storage offset defaults to None not 0 (#87481)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87481
Approved by: https://github.com/bdhirsh
2022-10-21 20:12:40 +00:00
c55b332517 Delete unused static runtime experiment (#87473)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87473
Approved by: https://github.com/anijain2305
2022-10-21 20:03:24 +00:00
dfc65f43f9 Delete unused ts experiment (#87472)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87472
Approved by: https://github.com/anijain2305
2022-10-21 20:03:24 +00:00
7baf4b1969 Delete unused ltc experiments (#87471)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87471
Approved by: https://github.com/anijain2305
2022-10-21 20:03:22 +00:00
62d30f5a8a Remove unused cold_start experiment (#87470)
- this `--cold_start` experiment didn't end up being used
- there is a new `--cold_start_latency` flag that is used
- this experiment was only hooked up for nvfuser anyway

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87470
Approved by: https://github.com/anijain2305
2022-10-21 20:00:05 +00:00
ee231671c0 Make torchbench setup a function (#87469)
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87469
Approved by: https://github.com/anijain2305
2022-10-21 19:58:38 +00:00
169ec120ef [Modes] refactor modes to only use a stack in cpp (#86458)
Refactors the mode code to only have the C++ mode stack and not the "C++ mode" like we originally had. This also simplifies the mode logic in a number of places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86458
Approved by: https://github.com/zou3519
2022-10-21 19:18:23 +00:00
13cad7e120 [BE] Remove pip and conda installation in Linux build workflow (#87256)
All the dependencies should come from the Docker container already. This only updates Linux build workflow, Linux test workflow comes later in a separate PR.

The `opt-einsum` package that was installed as part of PyTorch wheel has already been installed in the Docker container [requirements-ci.txt](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/requirements-ci.txt#L127)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87256
Approved by: https://github.com/malfet
2022-10-21 19:14:28 +00:00
620dbc43d8 Slowly introduce ops to be tested by test_numpy_ref on MPS backend (#87342)
Enable a test that would have caught https://github.com/pytorch/pytorch/issues/86239

Prior to the fix for that bug, this test fails with

```
_____________________________ TestCommonMPS.test_numpy_ref_mps_where_mps_float32 _____________________________
Traceback (most recent call last):
  File "/Users/alex/git/pytorch/test/test_ops.py", line 197, in test_numpy_ref_mps
    self.compare_with_reference(
  File "/Users/alex/git/pytorch/torch/testing/_internal/common_utils.py", line 2366, in compare_with_reference
    actual = torch_fn(t_inp, *t_args, **t_kwargs)
  File "/Users/alex/git/pytorch/torch/testing/_internal/opinfo/core.py", line 1068, in __call__
    return self.op(*args, **kwargs)
  File "/Users/alex/git/pytorch/torch/testing/_internal/common_methods_invocations.py", line 15167, in <lambda>
    op=lambda self, condition, other: torch.where(condition, self, other),
RuntimeError: 0'th index 3 of x tensor does not match the other tensors
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87342
Approved by: https://github.com/albanD
2022-10-21 19:03:00 +00:00
7bd04fb09f [1/N][C10D] Add a customized ScubaLogHandler implementation for internal FB use (#86699) (#87123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86699

This diff does the following:
1. **c10d_error_logger.py**: Add an API  to create a logger with a specific logging handler based on the destination.
2. The API from above would get a logging handler based on the destination provided.
-  **caffe2/torch/distributed/logging_handlers.py**: For OSS, we simply use a NullHandler() for now.
3. Add associated test files for 1 and 2.

Test Plan:
## Unit Test
```
buck test @//mode/dev-nosan //caffe2/test/distributed:test_c10d_error_logger -- --print-passing-details
```
```
File changed: fbcode//caffe2/test/distributed/test_c10d_error_logger.py
File changed: fbsource//xplat/caffe2/test/distributed/TARGETS
9 additional file changes
waiting for all tests to finish...
✓ Listing success: caffe2/test/distributed:test_c10d_error_logger (0.2s)
Found 1 tests
✓ Pass: caffe2/test/distributed:test_c10d_error_logger - test_get_or_create_logger (caffe2.test.distributed.test_c10d_error_logger.C10dErrorLoggerTest) (0.2s)

stdout:

stderr:

Buck UI:      https://www.internalfb.com/buck2/b975f6b0-77e9-4287-8722-f95b48036181
Test Session: https://www.internalfb.com/intern/testinfra/testrun/1407375150206593
RE: reSessionID-4d7ab8ca-1051-48e9-a5a8-6edbe15d1fe4  Up: 124 B  Down: 0 B
Jobs completed: 5. Time elapsed: 3.5s.
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. 0 builds failed
```

Differential Revision: D39920391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87123
Approved by: https://github.com/fduwjj, https://github.com/H-Huang
2022-10-21 18:45:38 +00:00
100beb2099 Only label checks against pull requests (#87488)
When a commit is triggered via any mechanism other than a pull request, there will not be a PR to check labels for.

The job will fail with the error:
```
2022-10-21T17:50:53.2938592Z + python3 .github/scripts/check_labels.py ''
2022-10-21T17:50:53.4758863Z usage: Check PR labels [-h] pr_num
2022-10-21T17:50:53.4759337Z Check PR labels: error: argument pr_num: invalid int value: ''
```

Instead, we should limit the workflow to only run on pull requests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87488
Approved by: https://github.com/huydhn
2022-10-21 18:15:40 +00:00
2a6079d588 fix for dynamo xml reporting (#87378)
dynamo tests call a helper function in torch/_dynamo/test_case.py which then calls run_tests in common_utils.py so the test report path looked something like /opt/conda/lib/python3/10/site-packages/torch/_dynamo/test_case

* instead of using frame, use argv[0] which should be the invoking file
* got rid of sanitize functorch test name because theyve been moved into the test folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87378
Approved by: https://github.com/huydhn
2022-10-21 18:13:56 +00:00
6e1764d806 ci: Allow nvidia-smi to continue with non-0 exit (#87464)
Allows nvidia-smi to return a non-0 exit status like status 14 since
status 14 is a warning and doesn't affect actual execution

see https://github.com/NVIDIA/gpu-operator/issues/285

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87464
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/ZainRizvi
2022-10-21 18:07:16 +00:00
9ad1659b17 functionalization: make view_copy outputs always contiguous (#85747)
This fixes an issue with mobile: The output of view_copy ops should always be contiguous.

Later, we can consider adding optional arguments to the `view_copy()` functions to let you explicitly say what the contiguity of the output can be (e.g. channels_last)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85747
Approved by: https://github.com/ezyang
2022-10-21 17:42:02 +00:00
294bfb8e80 Create workflow to make sure PRs have valid labels (#86829)
### Context
When a dev submits a PR against the repo, we want to validate that they applied two labels to the PR corresponding the module they edited and the kind of change they're making.

### Change
Extended the open source workflow CI to add a validation to ensure that the PR being checked has the required labels on it.  If it doesn't, the check fails and a bot will post a message on the PR with instructions on what labels the developer needs to add (https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work).

### Impact
Every time a new version of PyTorch is released, we want to compile all the changes made to each module. However, when devs forget to tag their PR, compiling the changes to write the release notes becomes a burdensome process (only ~20% of PRs are currently labeled appropriately, which means it can take up to 40 hours to compile release notes). With this new validation, the hope is that most PRs are labeled accordingly for more timely release notes compilation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86829
Approved by: https://github.com/ZainRizvi
2022-10-21 17:39:29 +00:00
fbcd4fe2d2 Skip auto request review on forked PR (#87482)
Addresses the comment in https://github.com/pytorch/pytorch/pull/87409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87482
Approved by: https://github.com/albanD
2022-10-21 17:39:01 +00:00
5b7f027d91 Remove redundant zeroing in col2im/im2col (#87375)
All of the kernels already either start by zeroing the output, or are
careful in their implementation to write values to every output
location. So, these `zero_` calls should be redundant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87375
Approved by: https://github.com/albanD
2022-10-21 17:32:15 +00:00
4fc72b0f4e Grammatical update of the tech docs. (#87357)
Fixes #ISSUE_NUMBER
A more appropriate and correct word.
![grammatical correction](https://user-images.githubusercontent.com/25278471/196927273-7e4c0c9b-96a6-43d1-9b10-17b40665feed.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87357
Approved by: https://github.com/albanD
2022-10-21 17:30:20 +00:00
6efdcb0788 Add dynamo smoke test (#87400)
https://github.com/pytorch/torchdynamo/issues/1733

Move the old smoke test over from the old dynamo repo.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87400
Approved by: https://github.com/msaroufim
2022-10-21 17:30:14 +00:00
db83a0578c [inductor] force 'fork' method for processes, cleanup (#87411)
To cooperate with other multithreading methods, this
forces the process pool to use 'fork' even if others have set it
diferently. We require fork because otherwise `if __name__ == __main__`
needs to be set which we do not control as a library.

Furthermore this adds code to cleanup worker processes if
the parent exits abnormally (e.g. segfault). Previously we would leave
live but inactive workers around.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87411
Approved by: https://github.com/soumith, https://github.com/anijain2305
2022-10-21 17:06:56 +00:00
96691865b9 [dynamo] Unify raise_on_* config to suppress_errors and raise by default (#87440)
I noticed that a lot of bugs are being suppressed by torchdynamo's default
error suppression, and worse yet, there's no way to unsuppress them.  After
discussion with voz and soumith, we decided that we will unify error suppression
into a single option (suppress_errors) and default suppression to False.

If your model used to work and no longer works, try TORCHDYNAMO_SUPPRESS_ERRORS=1
to bring back the old suppression behavior.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87440
Approved by: https://github.com/voznesenskym, https://github.com/albanD
2022-10-21 17:03:29 +00:00
1133682c46 [FSDP][2/N] Fix grad zero vs. None edge case (#87308)
Some original parameters corresponding to one `FlatParameter` may have `None` gradient while others do not. In that case, the `flat_param.grad` must be non-`None`. However, FSDP should take care to expose the original parameters' gradients regardless. To achieve this, we track a `_is_grad_none` mask over the parameters' gradients.
- `_is_grad_none` is initialized to `False` for all.
- `_is_grad_none[i]` is set to `True` when writing zeros in place of `None` when writing back the `i`th gradient.
- `_is_grad_none[i]` is set to `False` via `_reset_is_grad_none()`, which should be called in the post-backward. See the docstring for details.
- `_is_grad_none[i]` must be `False` in order to set `param.grad` to be a view into `flat_param.grad`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87308
Approved by: https://github.com/zhaojuanmao
2022-10-21 17:01:24 +00:00
4ee13a5925 [FSDP][1/N] Update summon_full_params(with_grads) None gradient (#87314)
This PR changes `summon_full_params(with_grads=True)`'s behavior to be such that if all ranks have `flat_param.grad = None`, then the original parameters will correctly have `orig_param.grad = None`. This is achieved with a preliminary all-reduce. Note that if a particular original parameter's gradient is `None` on all of the containing ranks, but not all ranks' `flat_param.grad = None`, then that particular gradient is still going to be set to zeros. This can be handled if desired in follow-up work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87314
Approved by: https://github.com/zhaojuanmao
2022-10-21 17:01:23 +00:00
4caddac534 [quant][api] Add assert for backend in get_default_qconfig related apis (#86259) (#87331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86259

Add assertion to make sure backend is one of "fbgemm", "x86", "qnnpack" and "onednn"
for get_default_qconfig, get_default_qat_qconfig, get_default_qconfig_mapping and get_default_qat_qconfig_mapping

Test Plan:
python test/test_quantization.py -k test_get_default_qconfig_mapping

Imported from OSS

Reviewed By: jcaip

Differential Revision: D40236474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87331
Approved by: https://github.com/andrewor14
2022-10-21 16:57:35 +00:00
4cc5d6644f [FSDP][6/N] Remove FPW! (#87114)
This PR simply deletes `flatten_params_wrapper.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87114
Approved by: https://github.com/zhaojuanmao
2022-10-21 16:56:32 +00:00
f8dd27420b [FSDP][5/N] Update FlatParamHandle after FPW deprecation (#87113)
This PR resolves a TODO left in `FlatParamHandle` that was conditional on deprecating `FlattenParamsWrapper`. We simply pass in the process group into the `FlatParamHandle` constructor instead of later in `shard()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87113
Approved by: https://github.com/zhaojuanmao
2022-10-21 16:56:32 +00:00
214d51756a [FSDP][4/N] Rework FPW test to not use FPW (#87112)
Testing coverage is pretty much preserved except that we do not test on CPU, which is not a tangible loss for FSDP anyway.

I renamed a few tests slightly, and I moved some helpers to be immediately below the corresponding test method. This makes it a bit easier to read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87112
Approved by: https://github.com/zhaojuanmao
2022-10-21 16:56:29 +00:00
277e37f945 [FSDP][3/N] Register flat_param to wrapped module (#87086)
This PR registers each `FlatParameter` to the wrapped module, eliminating `FlattenParamsWrapper` usage completely from FSDP.

Registering each `FlatParameter` to the wrapped module is preferred over registering to the `FullyShardedDataParallel` instance for both functional-like and non-recursive wrapping. It simplifies the `FlatParameter` naming to be a function of the number of `FlatParameter`s per wrapped module instead of the number of `FlatParameter`s per FSDP instance. For now, we assume 1 `FlatParameter` per wrapped module, so we can simply use a single name `FLAT_PARAM = _flat_param`.

From an implementation perspective, we raise some methods from `FlattenParamsWrapper` directly up to `FullyShardedDataParallel`. There will need to be further refactoring for functional-like and non-recursive wrapping. For example, the property `self._has_params -> bool` may need to change to a method `self._has_params(wrapped_module) -> bool`. Such changes are out of scope for this PR and will be done in follow-ups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87086
Approved by: https://github.com/zhaojuanmao
2022-10-21 16:56:26 +00:00
9f8ef8eaff [FSDP][2/N] Remove _fsdp_wrapped_module.flat_param (#86122)
This removes **direct** usages of `_fsdp_wrapped_module.flat_param` with `_handles[0].flat_param`. The preferred way to access the `flat_param` will be through the handle. We may converge to only storing `self._handles` and no longer `self.params` in the future. Right now, `self.params` is always exactly `[handle.flat_param for handle in self._handles]`.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86122
Approved by: https://github.com/zhaojuanmao
2022-10-21 16:56:24 +00:00
ce0c6e828e Reland "add an API for external backends to register custom device names (#86992)" (#87453)
Re-land of https://github.com/pytorch/pytorch/pull/86992

This reverts commit a895af92506f206889610251624590798d0deabd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87453
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-10-21 16:51:36 +00:00
70c46d32e2 Fix input dimension issue in RNN, LSTM, GRU error message (#87442)
Fixes #86576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87442
Approved by: https://github.com/albanD
2022-10-21 16:28:32 +00:00
0c1dec375f Revert "Back out "Revert D40198461: [pytorch][PR] Backport currently dont work with some models if:" (#87124)"
This reverts commit a42fbfa0cb467b582799a5132561c82a3d33b1b7.

Reverted https://github.com/pytorch/pytorch/pull/87124 on behalf of https://github.com/ZainRizvi due to This is causing periodic jobs to fail
2022-10-21 16:03:00 +00:00
d73d4aa7de Audit for error prone isinstance int/float and add lint (#87345)
We recently fixed a bug on symbolic-shapes branch where
an isinstance(x, int) test failed when passed a SymIntNode.
To prevent this, I've added a lint for all the codepaths
where we may pass SymInt/SymFloat directly to reject
direct isinstance int/float tests, and instead use one of
the aliases.  The lint rule explains the options.  I then
go and fix all of them.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87345
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2022-10-21 15:55:24 +00:00
1285542f9b OpInfo: Add test that sample_inputs_func returns a generator (#84567)
This also includes a small list exception for single element lists since none of the memory usage or performance implications of lists apply there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84567
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-21 15:28:47 +00:00
aa8248cc9a Reenable isinstance with torch.distributed.ReduceOp (#87303)
tentatively marking as draft as I haven't gotten a comprehensive list of side effects...

Ref: https://stackoverflow.com/questions/40244413/python-static-class-attribute-of-the-class-itself
Rel: https://github.com/pytorch/pytorch/issues/87191

cc @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87303
Approved by: https://github.com/wanchaol
2022-10-21 15:05:36 +00:00
d37dc6f698 Make LazyGraphExecutor extensible (#87218)
Add `LazyGraphExecutor` to backend interface so that its is extensible by a vendor backend.

I've made some preliminary methods virtual. Not sure if we want to make all methods in `LazyGraphExecutor` virtual.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87218
Approved by: https://github.com/wconstab, https://github.com/alanwaketan
2022-10-21 14:28:14 +00:00
d80a5f9a96 Fix typo under torch directory (#87274)
This PR fixes typo in .md files under torch directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87274
Approved by: https://github.com/albanD
2022-10-21 14:22:20 +00:00
ae62cf7c02 [MPS] Revamp copy_to_mps_ implementation (#86956)
Tensor's view in linear storage is represented by the following parameters: `.shape`, `.stride()` and `.storage_offset()`.

Only tensors that are representable as 1d-views can be copied from host to device (and vice versa) using single  [`copy(from:sourceOffset:to:destinationOffset:size:)`](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language=objc) call.

Modify `copy_to_mps_` function to do the following steps:
- Cast `src` tensor to dst data type if needed
- Expand `src` tensor to `dst` tensor shape
- Clone `src` tensor if it is not stride contiguous (i.e. can not be represented by `src.view(src.numel())`)
- Create an empty tensor if `dst` is not stride-contiguous or if its strides are different then potentially cloned `src` strides
- Do 1d copy for `src` to (potentiall temp) `dst`
- Finally do re-striding/copy on MPS if needed

Add test to cover cases where stide-contiguous permuted tensor is copied to MPS, non-stride-contiguous tensor is copied to MPS and if permuted CPU tensor is copied to differently permuted MPS tensor

Fixes https://github.com/pytorch/pytorch/issues/86954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86956
Approved by: https://github.com/kulinseth
2022-10-21 14:10:05 +00:00
435e78e523 [dynamo] [easy] RM spurious ) (#87439)
Fixes #ISSUE_NUMBER

cc @jansel @lezcano @fdrocha @mlazos @soumith @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87439
Approved by: https://github.com/msaroufim, https://github.com/soumith
2022-10-21 07:55:23 +00:00
ab901b4817 Python binding for dispatcher getAllOpNames (#87422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87422
Approved by: https://github.com/bdhirsh
2022-10-21 06:55:10 +00:00
7caeac1718 [inductor] Fix channels_last conv2d propagation when CuDNN is not found (#87266)
Fixes https://github.com/pytorch/torchdynamo/issues/1701

cc @jansel @lezcano @fdrocha @mlazos @voznesenskym @yanboliang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87266
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/voznesenskym
2022-10-21 06:36:16 +00:00
6b59d9b566 Fix registration hooks (#87369)
There is a bug in the implementation of the registration hooks introduced in https://github.com/pytorch/pytorch/pull/86148 whereby if the hook returns a tensor, then the short circuiting logic:
```
value = hook(self, name, value) or value
```
Raises an exception
```
RuntimeError: Boolean value of Tensor with more than one value is ambiguous
```
Fixing the logic so that it only checks to see if the value is `None` before overriding

Fixes #85837

CC: @albanD @jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87369
Approved by: https://github.com/albanD
2022-10-21 05:12:25 +00:00
ff43288d31 [AOT][CUDAGraphs] torchdynamo -> torch._dynamo (#87243)
Fixes lingering issues from the torchdynamo -> torch._dynamo migration
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87243
Approved by: https://github.com/suo, https://github.com/voznesenskym, https://github.com/jansel
2022-10-21 03:14:28 +00:00
13ab819356 [functorch] fix AOTAutograd tutorial (#87415)
It was raising asserts previously
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87415
Approved by: https://github.com/Chillee
2022-10-21 01:53:24 +00:00
b1cf377cce Enable inductor CI for huggingface (#86792)
Summary: Unit tests will be enabled after fixed in trunck. TorchBench and TIMM need
more setup and are coming later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86792
Approved by: https://github.com/jansel, https://github.com/huydhn
2022-10-21 01:38:46 +00:00
9ba632253a [Inductor] Convert 0d CPU tensor to scalar during triton codegen (#87329)
This is a follow up to address [this](https://github.com/pytorch/torchdynamo/pull/1284#pullrequestreview-1130319129). We revised to use the codegen approach to handle 0d CPU tensor, which will not support cudagraph any more.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87329
Approved by: https://github.com/ngimel
2022-10-21 01:24:00 +00:00
961ebca225 Add weights_only option to torch.load (#86812)
This addresses the security issue in default Python's `unpickler` that allows arbitrary code execution while unpickling.
Restrict classes allowed to be unpicked to in `None`, `int`, `bool`, `str`, `float`, `list`, `tuple`, `dict`/`OrderedDict` as well as `torch.Size`, `torch.nn.Param` as well as  `torch.Tensor` and `torch.Storage` variants.

Defaults `weights_only` is set to `False`,  but allows global override to safe only load via `TORCH_FORCE_WEIGHTS_ONLY_LOAD` environment variable.

To some extent, addresses https://github.com/pytorch/pytorch/issues/52596
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86812
Approved by: https://github.com/ezyang
2022-10-21 01:09:50 +00:00
e3d73bbb07 Remove jansel/voz from dynamo CODEOWNERS (#87430)
Now that CC bot is working on PRs this is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87430
Approved by: https://github.com/voznesenskym
2022-10-21 00:59:31 +00:00
bd1e95ce30 Improve the performance of validate_non_overlapping_shards_metadata (#85639)
`validate_non_overlapping_shards_metadata()` uses a quadratic algorithm to verify the overlapping. However, in some cases (only one dimension is sharded), we a O(nlogn) algorithm can easily be implemented. This PR changes the implementation of `validate_non_overlapping_shards_metadata()`.

Differential Revision: [D39681725](https://our.internmc.facebook.com/intern/diff/D39681725/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85639
Approved by: https://github.com/wanchaol
2022-10-20 23:51:48 +00:00
a42fbfa0cb Back out "Revert D40198461: [pytorch][PR] Backport currently dont work with some models if:" (#87124)
Summary:
reland after fixing windows build failure for OVR.

Notable change:
```
#if defined(FBCODE_CAFFE2) or defined(FB_XPLAT_BUILD)
```
changed to
```#if defined(FBCODE_CAFFE2) || defined(FB_XPLAT_BUILD)
```
Appearently `-DFB_XPLAT_BUILD` wasn't getting picked up in windows if using `or `to connect

Original commit changeset: 7a31fc4b455f

Original Phabricator Diff: D40198461

Test Plan: waitforsandcastle

Reviewed By: davidberard98, cccclai

Differential Revision: D40290932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87124
Approved by: https://github.com/gmagogsfm
2022-10-20 23:02:10 +00:00
f38a88c4dd Revert "[dynamo] use optimizers correctly in benchmarking (#87311)"
This reverts commit 703c19008df4700b6a522b0ae5c4b6d5ffc0906f.

Reverted https://github.com/pytorch/pytorch/pull/87311 on behalf of https://github.com/anijain2305 due to Bin (desertfire) is trying to get torchbench models in CI, and this PR prevents that. I will bring this back after models are in CI.
2022-10-20 22:01:51 +00:00
a91abedf0d [Inductor] TorchInductor tracing fx_graph.py should import overrides (#87271)
Running the generated script would be failed if there are ops like ```philox_rand_like``` and ```philox_rand_like```.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87271
Approved by: https://github.com/jansel
2022-10-20 21:59:12 +00:00
1801b57cf6 set ci in mps (#87325)
dunno if installing xml runner like this is a good idea
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87325
Approved by: https://github.com/huydhn, https://github.com/malfet
2022-10-20 21:50:20 +00:00
f7da9db9c1 Unify decomp registries into global_decomposition_table (#86857)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86857
Approved by: https://github.com/ezyang
2022-10-20 21:29:05 +00:00
7e83f65ad5 Add General Project Policies (#87385)
Add General Project Policies to the Governance page

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87385
Approved by: https://github.com/orionr
2022-10-20 21:02:09 +00:00
17202b3637 [maskedtensor] fix docs formatting (#87387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87387
Approved by: https://github.com/cpuhrsch
2022-10-20 20:48:25 +00:00
bc8cf33244 add deprecation warning to nn stateless functional_call (#87367)
Same as the release version but just for master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87367
Approved by: https://github.com/albanD, https://github.com/atalman
2022-10-20 20:16:49 +00:00
9b88dcf248 [ci] handle libomp upgrade on github (#87382)
like #86979, idk if this is a good idea but it seems to fix the problem
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87382
Approved by: https://github.com/seemethere
2022-10-20 19:40:59 +00:00
0826863962 [functorch][docs] Downgrade the warning about forward-mode AD coverage (#87383)
Previously we claimed that "forward-mode AD coverage is not that good".
We've since improved it so I clarified the statement in our docs and
downgraded the warning to a note.

Test Plan:
- view docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87383
Approved by: https://github.com/samdow
2022-10-20 18:51:13 +00:00
2fd008ed43 [dynamo] Add support for invoking nn sequential (#87156)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87156
Approved by: https://github.com/jansel
2022-10-20 18:14:40 +00:00
68e946b0c3 Fixed tune_layout to not do anything for non-2d convolutions (#87328)
cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87328
Approved by: https://github.com/ngimel
2022-10-20 18:02:51 +00:00
b805e1abef [functorch] Fix torch.cat batching rule (#86932)
The bug was discovered in https://github.com/pytorch/pytorch/pull/86842.

torch.cat has an edge case where it ignores all tensors of shape [0]. So
if any of the BatchedTensors have logical shape [0] but physical shape
[B, 0], then we coerce them to shape [0] by slicing them.

Why don't we just ignore those Tensors? We need to propagate
requires_grad-ness somehow (e.g. if the BatchedTensor wraps a Tensor of
shape [B, 0] that requires grad, then the output must require grad).

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86932
Approved by: https://github.com/Chillee
2022-10-20 18:01:31 +00:00
c16b7b41f7 [Profiler][Trivial] Small style and safety fixes (#86752)
I noticed a couple abbreviations in the new optimizer capture code that are worth expanding. I also made the RawTensorMetadata a bit safer.

Differential Revision: [D40210702](https://our.internmc.facebook.com/intern/diff/D40210702/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86752
Approved by: https://github.com/slgong-fb, https://github.com/aaronenyeshi
2022-10-20 17:34:16 +00:00
1e4a274248 [dynamo] avoid popen.communicate() (#87335)
It seems like when popen.communicate() is used it waits for all the
desendents of popen to close the stdin/stderr. However, if we have
have worker processes running in the child, and the child segfaults,
those processes will stay alive until someone waitpid's the child.
Since those children have open handles to the stdin/stderr pipe,
communicate never returns.

This change just writes the output to temp files and directly calls
wait() on the child, which returns as soon as it dies.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87335
Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
2022-10-20 17:28:27 +00:00
75a5a46aa0 Retry sccache downloads (#87306)
This is meant to mitigate network flakiness like the one seen on [this build](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872) which results in s3 refusing a connection and sccache failing to download

Adding the retry at the workflow level instead of the curl level since as per the job it doesn't seem like the curl command was retried at all.   It's possible that the specific html code returned during "Connection refused" isn't one of the ones the gets retried, or the retries don't show on the console and it needed a longer period of time between retries or that.

Using the job level retry with a generous retry delay solves for both possibilities.

Sample error log:
```
Run sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
  sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v[2](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:2).15 --output /usr/local/bin/sccache
  sudo chmod +x /usr/local/bin/sccache
  echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
  echo "SCCACHE_S[3](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:3)_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"
  shell: /bin/bash -e {0}
  env:
    AWS_ACCESS_KEY_ID: ***
    AWS_SECRET_ACCESS_KEY: ***
    BUILD_ENVIRONMENT: macos-12-py3-x86-6[4](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:4)
    DEVELOPER_DIR: /Applications/Xcode_13.3.1.app/Contents/Developer
    CONDA_ENV: /Users/runner/work/_temp/conda_environment_3283124[6](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:6)93
    CONDA_RUN: conda run -p /Users/runner/work/_temp/conda_environment_3283124693 --no-capture-output
    CONDA_BUILD: conda run -p /Users/runner/work/_temp/conda_environment_3283124693 conda-build
    CONDA_INSTALL: conda install -p /Users/runner/work/_temp/conda_environment_3283124693
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: ([7](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:7)) Failed to connect to s3.amazonaws.com port 443 after [8](https://github.com/pytorch/pytorch/actions/runs/3283124693/jobs/5407443872#step:6:8)6 ms: Connection refused
Error: Process completed with exit code 7.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87306
Approved by: https://github.com/seemethere
2022-10-20 17:16:45 +00:00
4b757f4633 Assert if padding mask type is unexpected (#86353) (#87106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86353

Fix the issue described in
https://github.com/pytorch/pytorch/issues/86120

Test Plan: buck test mode/opt caffe2/test:test_transformers -- test_train_with_long_type_pad

Differential Revision: D40129968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87106
Approved by: https://github.com/malfet
2022-10-20 16:01:54 +00:00
38543d8da0 [torch] Add fmsub to vectrozation primitives (#86568)
Summary: Add fmsub  which is similar to fmadd

Test Plan: CI

Differential Revision: D40215267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86568
Approved by: https://github.com/ajtulloch, https://github.com/malfet
2022-10-20 15:10:44 +00:00
a895af9250 Revert "add an API for external backends to register custom device names (#86992)"
This reverts commit fb6826bfd82660aa905459f894c81d97d143dd2c.

Reverted https://github.com/pytorch/pytorch/pull/86992 on behalf of https://github.com/jeanschmidt due to breaking internal builds - D40534212 - arstudio-windows-tests-landcastle-0
2022-10-20 14:51:08 +00:00
9199f9188c Add inplace function testing to test_proxy_tensor (#87324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87324
Approved by: https://github.com/ezyang
2022-10-20 14:20:19 +00:00
254b681dc6 Convert torch.Size() argument to sym size in test_proxy_tensor (#87304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87304
Approved by: https://github.com/ezyang
2022-10-20 14:20:19 +00:00
9bd6ea5d76 Add meta inplace testing (#87291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87291
Approved by: https://github.com/ezyang
2022-10-20 14:20:16 +00:00
2e08ac8696 Add randint OpInfo (#87231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87231
Approved by: https://github.com/ezyang
2022-10-20 14:20:12 +00:00
8b704eddcd Update the pinned triton hash (#87300)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87300
Approved by: https://github.com/jansel
2022-10-20 14:15:48 +00:00
c4cf701889 Revert "[complex] conv_transpose2d (#81805)"
This reverts commit 528dd05108cdac6726748c34e385b5c3136256df.

Reverted https://github.com/pytorch/pytorch/pull/81805 on behalf of https://github.com/jeanschmidt due to Breaking internal builds - D40534110 - android-java-tests-0
2022-10-20 13:44:15 +00:00
05ad7bd743 Revert "Advance nightly docker to 11.6 (#86941)"
This reverts commit c5de535bc0b785abbacfebddf660af4cd3b2a6a1.

Reverted https://github.com/pytorch/pytorch/pull/86941 on behalf of https://github.com/atalman due to Workflow is passing but installs CUDA 11.3 PyTorch rather then 11.6
2022-10-20 13:17:11 +00:00
1b8af28fe8 [primTorch] Add refs for softmax, softmin, log_softmax (#84956)
cc @ezyang @mruberry @ngimel @Lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-20 12:29:04 +00:00
703c19008d [dynamo] use optimizers correctly in benchmarking (#87311)
We were not setting optimizers correctly

* This hid the issue that we see here - https://github.com/pytorch/torchdynamo/issues/1687
* This has also revealed that we are activating profilers for every dynamo optimized model call. This could affect speedup

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87311
Approved by: https://github.com/mlazos, https://github.com/yanboliang
2022-10-20 05:46:25 +00:00
8349bf1cd1 Added special printing to FloorDiv so it's printed out with // insead of as a name (#87263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87263
Approved by: https://github.com/ezyang
2022-10-20 05:06:22 +00:00
b90db4a78f [DataPipe] Fix type checking to accept both Iter and Map DataPipe (#87285)
Fixes https://github.com/pytorch/data/issues/841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87285
Approved by: https://github.com/NivekT
2022-10-20 05:05:56 +00:00
d94e33f041 Add support for .to() for NestedTensor backends (#87146)
Summary: This commit adds support for moving NestedTensors from CPU to GPU and back. The implementation includes requires implementing empty_like(), which is based on PR#83140.

Test Plan: Added a new unit test based on the unit test for the main .to() implementation. All unit tests must pass, as well as every sandcastle job.

Differential Revision: D40437585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87146
Approved by: https://github.com/drisspg
2022-10-20 03:46:50 +00:00
472bdb3aa8 [vision hash update] update the pinned vision hash (#87339)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87339
Approved by: https://github.com/pytorchbot
2022-10-20 03:45:18 +00:00
c18eead2df Update saved variable hooks to no longer trigger on wrapped numbers (#87316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87316
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-10-20 03:01:11 +00:00
0cae309069 [Quant] Add get_symmetric_qnnpack_qconfig_mapping (#87002)
Summary: Today, in order to get XNNPACK quantized ops to work,
the user must write some code that refers to private data
structures (`_FIXED_QPARAMS_OP_TO_OBSERVER`) to create a
QConfigMapping that is compatible with the symmetric constraints
in the QNNPACK BackendConfig. This is because
`get_default_qconfig("qnnpack")` produces a QConfig that does
not satisfy these constraints, and the default QConfigMapping
for QNNPACK uses this Qconfig.

Instead, we simply put this code into a helper function to make
it easier for the user to run XNNPACK quantized ops. In the
future, once there is feature parity between the set of ops
supported by QNNPACK and XNNPACK, we should revisit whether
to simply change `get_default_qconfig("qnnpack")` to return
an XNNPACK-compatible QConfig.

Test Plan:

python test/test_quantization.py
TestQuantizeFx.test_symmetric_qnnpack_qconfig_mapping

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87002
Approved by: https://github.com/vkuzo
2022-10-20 02:33:15 +00:00
e6bc8f415b [BE] Move conda cmake installation to Docker (#87309)
This is parts of the effort to consolidate pip and conda installation in the CI to improve our CI reliability.  This moves conda cmake installation to Docker in those use cases that require it:

* Ubuntu bionic and focal

On the other hand:
* XLA doesn't seem to need conda cmake anymore (Build and test successfully)
* Centos is not in used anywhere in the CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87309
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2022-10-20 02:13:11 +00:00
0d2c2110f1 [allocator] Introduce the abstract class CUDACachingAllocator (#87251)
This replaces the manual function pointers, making it easier to write
new drop-in allocators.

Note that most allocation goes through the Allocator interface, which
CUDAAllocator inherits from, and this arrangement avoids adding and
additional layer of dispatch along this pathway compared to what existed before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87251
Approved by: https://github.com/wconstab
2022-10-20 01:17:00 +00:00
888e15408e Fix wrong lintrunner version (#87295)
The syntax is invalid for pip.  I missed this a while back:

```
Run pip install -r .github/requirements-gha-cache.txt
ERROR: Invalid requirement: 'lintrunner=0.9.2' (from line 11 of .github/requirements-gha-cache.txt)
Hint: = is not a valid operator. Did you mean == ?
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87295
Approved by: https://github.com/ZainRizvi
2022-10-20 01:04:42 +00:00
bd757b364c Ensure that symbolic variables incorporate fresh constraints before they're used (#87254)
cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87254
Approved by: https://github.com/jansel
2022-10-20 00:37:40 +00:00
bcde75427e run torch::deploy test using pip install (#86507)
This PR runs the unit tests for [multipy](https://github.com/pytorch/multipy) in pytorch core such that we are able to make sure changes in core do not break multipy as adding `_prims` did.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86507
Approved by: https://github.com/anirbanr-fb-r2p, https://github.com/d4l3k
2022-10-20 00:15:45 +00:00
07bd053a7e [rpc] Wrap exception creation with try/catch (#87224)
Sometimes, we cannot recreate the exception with only string (for example if it is a custom exception type). Ideal situation would be to carry over all details on how to recreate the remote end's exception and throw that on client, but for now, we raise a RuntimeError with the original error msg when we cannot reconstruct.

Created from CodeHub with https://fburl.com/edit-in-codehub

Differential Revision: [D40353274](https://our.internmc.facebook.com/intern/diff/D40353274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87224
Approved by: https://github.com/fduwjj
2022-10-20 00:02:24 +00:00
c97ffcff46 [discussion] fix for aot autograd outputs that dont require grad (#86838)
Fixes https://github.com/pytorch/functorch/issues/1052

I got here after some discussion with Alban. Today, if you aot_function() trace a program where some of its inputs have `requires_grad=True`, but some outputs are expected to have `requires_grad=False`, we will incorrectly set all outputs to have `requires_grad=True`.

A simple solution is to use autograd.function's API for marking outputs as non-differentiable, based on what we witnessed when we traced the forward.

This will make the `autograd.Function` that we return **wrong**, if you created it using inputs that required grad, and tried to re-use it with inputs that have different `requires_grad` field. But as long as we're hiding behind dynamo, which should guard on requires_grad, then we'll re-run `aot_function()` and get out a new compiled function that does the right thing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86838
Approved by: https://github.com/ezyang
2022-10-19 23:41:54 +00:00
c9b618447d Fix line numbers bug (#87247)
Fixes https://github.com/pytorch/torchdynamo/issues/1462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87247
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-10-19 22:44:01 +00:00
c8889f4e10 cuda._is_in_bad_fork->_C._cuda_isInBadFork (#87317)
Former is always available, while later is only available if PyTorch compiled with CUDA And if it does, then
```
$ python -c "import torch;print(torch._C._cuda_isInBadFork == torch.cuda._is_in_bad_fork)"
True
```

Fixes https://github.com/pytorch/torchdynamo/issues/1709 ( at least the symptom)

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87317
Approved by: https://github.com/voznesenskym, https://github.com/albanD, https://github.com/soumith, https://github.com/jansel
2022-10-19 22:15:28 +00:00
56b150ac63 [Dynamo] Support optimizing over any Tensor with requires_grad = True (#87141)
Fixes https://github.com/pytorch/torchdynamo/issues/1604

Re-submit for https://github.com/pytorch/torchdynamo/pull/1646
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87141
Approved by: https://github.com/jansel
2022-10-19 22:13:07 +00:00
12b2f70a89 Symintify pad ops (#87046)
Following comments below, we need to add support for `std::negate`/`std::min`/`std::max`/`operator-` for SymInt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87046
Approved by: https://github.com/ezyang
2022-10-19 21:43:08 +00:00
c5de535bc0 Advance nightly docker to 11.6 (#86941)
Fixes following:
https://github.com/pytorch/pytorch/actions/runs/3242695506/jobs/5316334351
crash in Docker builds introduced by: #82682

The PR seems to introduce some changes not compatible with cuda 11.3 which is used by our Docker builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86941
Approved by: https://github.com/malfet
2022-10-19 21:26:55 +00:00
6eeeb88172 OpInfo: Sample input cleanup (4/n) (#86324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324
Approved by: https://github.com/mruberry
2022-10-19 21:25:45 +00:00
c141f28b64 Fix compilation warning and spurious print (#87297)
Fixes compilation warning, make this warning an error and remove a random print.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87297
Approved by: https://github.com/malfet
2022-10-19 20:56:37 +00:00
4a533f1215 Tweak several test serialization to store models state_dict (#87143)
Namely, change:
- `test_meta_serialization`
- `test_serialization_2gb_file`
- `test_pathlike_serialization`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87143
Approved by: https://github.com/ezyang
2022-10-19 20:51:32 +00:00
cf2be34ff5 [maskedtensor] add docs (#84887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84887
Approved by: https://github.com/cpuhrsch
2022-10-19 20:44:34 +00:00
cd21613526 Revert "[primTorch] Add refs for softmax, softmin, log_softmax (#84956)"
This reverts commit c09ca93e4733fdf0183433114dda2fc30a846700.

Reverted https://github.com/pytorch/pytorch/pull/84956 on behalf of https://github.com/ZainRizvi due to This is causing the MPS test test_output_match_log_softmax_with_dtype_cpu_float32 (__main__.TestConsistencyCPU) to fail
2022-10-19 20:36:55 +00:00
c08c799750 [FSDP] Add set_state_dict_type API to setup state_dict_type without using context manager (#86243)
FSDP.state_dict_type is a context manager. However, users may want to decide what state_dict is going to used during initialization. `set_state_dict_type` allows users to do so.

Differential Revision: [D40083670](https://our.internmc.facebook.com/intern/diff/D40083670/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86243
Approved by: https://github.com/rohan-varma
2022-10-19 19:46:18 +00:00
f3cc588d09 Revert "Dynamo FX graph stack traceback fix (#87136)"
This reverts commit 89e6078bc3d83b61e03511304ec42743b84df42e.

Reverted https://github.com/pytorch/pytorch/pull/87136 on behalf of https://github.com/clee2000 due to causing a lot of tests to fail on master even though pr is green
2022-10-19 18:57:24 +00:00
c09ca93e47 [primTorch] Add refs for softmax, softmin, log_softmax (#84956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84956
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-19 18:45:40 +00:00
00c91f4446 [allocator] disable tests that don't work for cudaMallocAsyncAllocator (#87250)
Two tests were failing locally for me and don't appear to be run in our CI.
Disabling them so we can otherwise refactor the allocators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87250
Approved by: https://github.com/wconstab
2022-10-19 18:29:35 +00:00
15ca68526c [functorch] Get rid of defunct functorch/setup.py (#87235)
We initially left it there for BC concerns.
- It has been more than a month since then,
- I have migrated folks who used the previous install command (pip
install ...pytorch.git@subdir=functorch) off of it

so it's time to get rid of it

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87235
Approved by: https://github.com/Chillee
2022-10-19 18:01:55 +00:00
ac80da2293 [functorch] add test for torch.manual_seed inside grad transform (#87233)
I can see this behavior regressing really easily, so adding a test for
it.

Test Plan:
- run test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87233
Approved by: https://github.com/Chillee
2022-10-19 18:01:55 +00:00
f56ce8dbad [allocator] Move getFreeMutex (#87237)
It isn't used at all the allocators and this change makes that more clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87237
Approved by: https://github.com/wconstab
2022-10-19 18:00:40 +00:00
89e6078bc3 Dynamo FX graph stack traceback fix (#87136)
Migration from https://github.com/pytorch/torchdynamo/pull/1655.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87136
Approved by: https://github.com/voznesenskym
2022-10-19 17:15:43 +00:00
40d0fa5314 Reenable aot tests on windows for cuda 11.7 and up (#87193)
Reenable aot tests on windows for cuda 11.7 and up

Issue: https://github.com/pytorch/pytorch/issues/69460 seems to be mitigated in CUDA 11.7 hence re-enable this test

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87193
Approved by: https://github.com/malfet
2022-10-19 17:09:37 +00:00
86a581928a Pin ios conda dependencies (#87229)
I also pin blas to 1.0 instead of the newer 2.116 available elsewhere (https://anaconda.org/conda-forge/blas)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87229
Approved by: https://github.com/izaitsevfb, https://github.com/ZainRizvi, https://github.com/malfet
2022-10-19 17:01:11 +00:00
a79e034d89 [MPS] Do not dispatch empty job in bitwise_not (#87286)
Follows the pattern from https://github.com/pytorch/pytorch/pull/85285 and returns before computing dispatching an empty metal kernel for bitwise not operation.

Fixes crash when invoked with empty MPS tensor on AMD GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87286
Approved by: https://github.com/kulinseth
2022-10-19 17:00:10 +00:00
6775c3e19d fix 0d cpu tensor handling when it's the first arg (#87273)
Fixes https://github.com/pytorch/torchdynamo/issues/1681
When at least one of the pw args is on cuda, set device to cuda. We assume that cases of true device mismatch have been already weeded out during tracing, and what we have is 0d cpu tensor + cuda tensor interop.
Also fix 0d tensor test that previously wasn't compiling with dynamo.

cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87273
Approved by: https://github.com/soumith, https://github.com/voznesenskym
2022-10-19 16:55:27 +00:00
fb6826bfd8 add an API for external backends to register custom device names (#86992)
This API adds some improvements to external backends who are building C++ backends out of tree using the `PrivateUse1` dispatch key.

The docs and linked examples go over the API in more detail, but you should be able to use it like:
```
# This should probably be in the __init__.py file of a external backend's python package
> torch.register_privateuse1_backend("foo")`
# And it will allow the user to do this:
> a = torch.ones(2, device="foo")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86992
Approved by: https://github.com/albanD
2022-10-19 16:44:17 +00:00
cc64863d71 Clean Inductor complication cache during dynamo dashboard run (#87246)
Implement improvement from https://github.com/pytorch/torchdynamo/issues/1644.

Tested by running `python benchmarks/dynamo/runner.py --print_run_commands --training` and inspecting the generated `run.sh` file for the `--cold_start_latency` flag, e.g.
```
python benchmarks/dynamo/torchbench.py --performance --float32 -dcuda --output=benchmark_logs/inductor_torchbench_float32_training_cuda_performance.csv --training --inductor   --no-skip --dashboard -x fambench_xlmr -x detectron2_fasterrcnn_r_50_c4 -x detectron2_fasterrcnn_r_50_dc5 -x detectron2_maskrcnn_r_101_fpn -x detectron2_maskrcnn_r_50_fpn -x detectron2_fasterrcnn_r_50_fpn -x detectron2_maskrcnn -x detectron2_fasterrcnn_r_101_dc5 -x opacus_cifar10 -x detectron2_maskrcnn_r_101_c4 -x pyhpc_turbulent_kinetic_energy -x maml -x detectron2_fasterrcnn_r_101_fpn -x pyhpc_equation_of_state -x detectron2_fasterrcnn_r_101_c4 -x pyhpc_isoneutral_mixing --cold_start_latency
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87246
Approved by: https://github.com/anijain2305, https://github.com/jansel
2022-10-19 16:39:12 +00:00
b3071e2eb6 functionalization: skip meta reference compute for aot autograd (#87108)
The context is that historically, XLA/LTC tensors haven't had accurate stride information, and functionalization would run "reference" meta kernels for view ops on the side to properly compute strides.

This is more complicated in symint tracing world - we have a `FunctionalTensorWrapper()` that wraps the underlying tensor and has its own set of sizes/strides metadata, but we never create proxy objects for the sizes/strides of the wrapper.

In symint tracing world with aot autograd, we're guaranteed that our underlying strides are accurate anyway, since aot autograd uses fake tensors to perform tracing. We encountered a few bugs with symint's from the `FunctionalTensorWrapper` making their way into `__torch_dispatch__`. To side-step that area of bugs completely (and marginally improve perf), this PR disables the meta tensor tracing for non XLA/LTC use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87108
Approved by: https://github.com/ezyang, https://github.com/wconstab
2022-10-19 15:59:28 +00:00
4801397b6e ban .sizes() and .strides() calls in derivatives.yaml (#86611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86611
Approved by: https://github.com/wconstab, https://github.com/albanD
2022-10-19 15:59:28 +00:00
182ee87996 symintify nll loss fns (#86915) (#87095)
This reverts commit bbd7b38d5580c44ffb4404d431e07bc2316e59d5.

Reland https://github.com/pytorch/pytorch/pull/86915 with a fix for python arg parser handing for SymInt and SymIntList.
This was uncovered because we are calling directly into python bindings code through test_autocast.py (`torch._C._nn.nll_loss`)  without providing a value for the optional symint arg (`ignore_index`). The arg parser constructs the  SymInt and SymIntList using the recorded "default_int" or "default_int_list" (schema string parsing) in case a value is not received for an optional argument. Since we weren't handling the symint case properly, the default_int just had a garbage value which was later being used to construct SymInt.

Follow up issue for other unhandled parameter types: https://github.com/pytorch/pytorch/issues/87283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87095
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-10-19 14:50:51 +00:00
c6187ea326 add support for pin memory on xpu device (#86545)
add support for pin memory on xpu device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86545
Approved by: https://github.com/ezyang
2022-10-19 13:24:48 +00:00
528dd05108 [complex] conv_transpose2d (#81805)
Reference: https://github.com/pytorch/pytorch/issues/71108

Fixes : #86414
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81805
Approved by: https://github.com/anjali411
2022-10-19 09:12:27 +00:00
232fbd90ff [TorchDynamo]: fused bias for cpu convolution path (#87050)
For aten.convolution CPU path, the bias always can be fused, so this PR adds a device check: if inputs' device is CPU, we will fuse it for a good performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87050
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-19 07:13:38 +00:00
5e23074f0d Fixed FakeTensor not calling CompositeImplicitAutograd decomps sometimes (#87252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87252
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2022-10-19 07:13:30 +00:00
b5bdc34541 [inductor] Sympy compability fix (#87249)
Test Plan: github tests

Reviewed By: yf225, voznesenskym

Differential Revision: D40495411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87249
Approved by: https://github.com/ngimel, https://github.com/voznesenskym
2022-10-19 06:32:42 +00:00
6faa6c68e8 fsdp lazy_init typo (#87184)
Minor typo, changed with -> without
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87184
Approved by: https://github.com/awgu
2022-10-19 05:11:31 +00:00
2418ddb1ec Unified symbolic shape variables between Inductor and AOTDispatcher (#87161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161
Approved by: https://github.com/jansel
2022-10-19 04:50:34 +00:00
48df4b7a1d [vision hash update] update the pinned vision hash (#87100)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87100
Approved by: https://github.com/pytorchbot
2022-10-19 04:12:55 +00:00
dfe3fc028c [CI] Add triton wheels build workflow (#87234)
Also, add `torchtriton` and `jinja2` as extra `dynamo` dependency to PyTorch wheels,

Version packages as first 10 characters of pinned repo hash and make `torch[dynamo]` wheel depend on the exact version it was build against.

TODO: Automate uploading to nightly wheels storage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87234
Approved by: https://github.com/msaroufim
2022-10-19 03:35:16 +00:00
c413a32135 Release note script: match topics with spaces or underscores (#87011)
e.g. match "new features" in the category as "new_features"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87011
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-10-19 02:28:45 +00:00
c471c29fdc Update sdp guards for performance (#87241)
# Summary

Makes the contiguous check for the nt input more strict/correct as well as makes some performance improvements to the checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87241
Approved by: https://github.com/cpuhrsch
2022-10-19 02:16:31 +00:00
6d0d7afe8d [GHA][BE] Delete unused macros from common.yml.j2 (#87253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87253
Approved by: https://github.com/huydhn
2022-10-19 02:11:54 +00:00
31e731e5ae [dynamo] fix logging (#87239)
Currently, setting `torch._dynamo.config.log_level` doesn't do anything,
as the module name has changed during the move.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87239
Approved by: https://github.com/jansel, https://github.com/soumith, https://github.com/mlazos
2022-10-19 01:43:11 +00:00
7ff1ca4e33 Add type annotation to get_worker_info (#87017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87017
Approved by: https://github.com/ejguan, https://github.com/NivekT
2022-10-19 00:25:04 +00:00
4dc579838b Allow fx.Graph.owning_module to be used as attribute. (#86822)
Summary:
The current behavior of owning_module setter is difficult to understand: it changes the owning_module to None if owners is not 0 but increments the owners count. If the owning_module is None, the owners count should be 0 as none of them is accessible. On the other hand, if the owners count increases, the owning_module should be a collection (e.g. a list).

This diff changes owning_module to be a normal attribute. The semantic is that graph can have **at most one** owning module and can be assigned to new module.

The alternative is to use a list to represent the owning_modules of a graph but it breaks backward compatibility and the exact use cases of having multiple owning_modules are not clear.

Test Plan: Test with CI.

Differential Revision: D40200624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86822
Approved by: https://github.com/tugsbayasgalan
2022-10-19 00:12:59 +00:00
3eb7429385 [Profiler][trivial] Add profiler options to trace metadata (#87102)
Summary: Add profiler options (`profile_memory`, `record_shapes`, `with_stack`, `with_modules`, and `with_flops`) to trace metadata

Test Plan: CI tests

Differential Revision: D40373514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87102
Approved by: https://github.com/aaronenyeshi
2022-10-19 00:00:10 +00:00
f6c6048b10 Use CUTLASS GEMM for NT bmm (#85894)
Copy of https://github.com/pytorch/pytorch/pull/85710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894
Approved by: https://github.com/drisspg
2022-10-18 23:11:47 +00:00
80790ecee4 [einsum] Call view instead of sum to remediate MPS regression (#87135)
Fixes #87010.

It turns out that squeeze is much faster than sum, and view is faster than squeeze, so we should default to that whenever possible.

Benchmarking results show that, on MPS, we would be going from the following code taking **29.89ms instead of the current 1466ms, almost a 50x speedup**.
```
q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
torch.einsum('b i d, b j d -> b i j', q, k).max().item()
```
And a regular einsum will now take **.506ms instead of 2.76ms.**
```
q = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
k = torch.rand(16, 4096, 40, device='mps', dtype=torch.float)
torch.einsum('b i d, b j d -> b i j', q, k)
```

Special thanks to @soulitzer for helping me experiment + figure out how to squash the remaining 5x regression due to squeeze being slower than view!!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87135
Approved by: https://github.com/soulitzer, https://github.com/malfet, https://github.com/albanD
2022-10-18 23:01:28 +00:00
c4a03e4da1 [einsum] keep the promise that we contract left to right (#87199)
We promise that if path is not defined, we would go left to right. The previous code did not keep that promise as we push'd combined ops to the back of the list. For most use cases this is fine (einsum with 3 or fewer inputs), but we should do what we say.

Test plan:
Added a print statement to print the sizes of ops we're contracting to see if the order is fixed. Code run:
```
import torch
a = torch.rand(1)
b = torch.rand(2)
c = torch.rand(3)
d = torch.rand(4)
torch.einsum('a,b,c,d->abcd', a,b,c,d)
```

BEFORE--it does a+b, then c+d, then a+b+c+d, which...is right, but it's not the order specified by the user.
```
/Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
/Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
/Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
```

WITH THIS CHANGE--it actually goes left to right: a+b, a+b+c, a+b+c+d
```
/Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 1, 1, 1]and b: [1, 2, 1, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
/Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 1, 1]and b: [1, 1, 3, 1] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
/Users/janeyx/pytorch/torch/functional.py:378: UserWarning: Contracting a: [1, 2, 3, 1]and b: [1, 1, 1, 4] (Triggered internally at /Users/janeyx/pytorch/aten/src/ATen/native/Linear.cpp:507.)
  return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87199
Approved by: https://github.com/soulitzer
2022-10-18 22:58:44 +00:00
d06d569e90 Update the sdp benchmark to work with nested tensors (#87215)
# Summary
Update the sdp benchmark to work with nested tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87215
Approved by: https://github.com/cpuhrsch
2022-10-18 21:38:45 +00:00
e8c4adf3c3 Add torch.sparse overview section (#85265)
The goal of this section is to provide a general overview of how PyTorch handles sparsity for readers who are already familiar with sparse matrices and their operators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85265
Approved by: https://github.com/jisaacso
2022-10-18 21:07:57 +00:00
31edccf6c7 Revert "Temporarily disable ios jobs (#87186)"
This reverts commit d29dc2b72a6cb5fb24ff3eacd816e08bd16298dc.

Reverted https://github.com/pytorch/pytorch/pull/87186 on behalf of https://github.com/huydhn due to Official conda channel is back and conda-forge has been reverted
2022-10-18 21:03:23 +00:00
223ad9bc9e [ci] remove circleci mac jobs (#87225)
mac jobs are run on every pr after approval, so these are redundant
ios jobs can stay until the end of the year because they are on periodic and not run on every pr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87225
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/janeyx99
2022-10-18 20:57:57 +00:00
9a786202b7 [ci] fix log printing (#87223)
idk how i missed this

example https://github.com/pytorch/pytorch/actions/runs/3275717751/jobs/5391093040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87223
Approved by: https://github.com/malfet, https://github.com/kit1980, https://github.com/janeyx99
2022-10-18 20:57:27 +00:00
afa5086078 Revert "Install blas from conda-forge (#87150)"
This reverts commit f02f0e3ad1565e3da1e78efaa994e80c7577fd0c.

Reverted https://github.com/pytorch/pytorch/pull/87150 on behalf of https://github.com/huydhn due to Conda issue has been resolved upstream https://github.com/pytorch/pytorch/issues/87148
2022-10-18 20:54:06 +00:00
e7cefff058 [Kineto][Profiler] Guard event metadata python thread via verbose flag (#87096)
Summary: For Python Tracing enabled trace files, this field "python thread": 0 is repeated for every python_function event. This bloats the trace json size for large number of events or deep call stacks. Instead make this metadata guarded by the verbose flag.

Test Plan: CI

Reviewed By: robieta, slgong-fb

Differential Revision: D40325815

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87096
Approved by: https://github.com/slgong-fb, https://github.com/robieta
2022-10-18 20:47:09 +00:00
c54bcea793 Improve complex_memory_overlap check for Inductor CUDA graph (#87177)
Point fix for https://github.com/pytorch/torchdynamo/issues/1620 to unblock internal models. Supersedes https://github.com/pytorch/pytorch/pull/87058.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87177
Approved by: https://github.com/ezyang
2022-10-18 20:26:33 +00:00
ef1844a151 [CI] Move sm86 tests from periodic to trunk (#87228)
This adds Ampere GPU testing to trunk CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87228
Approved by: https://github.com/jansel, https://github.com/huydhn
2022-10-18 20:05:45 +00:00
1dbc8ad3b7 Add Warning class and refactor C++ warnings to use it (#84101)
Also adds `TORCH_WARN_WITH` and `TORCH_WARN_DEPRECATION` macros

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84101
Approved by: https://github.com/albanD
2022-10-18 20:02:42 +00:00
db65909255 [Docs] Update mm family ops and F.linear to note limited sparse support. (#86220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86220
Approved by: https://github.com/cpuhrsch
2022-10-18 19:55:18 +00:00
a73ca6f58c Revert "Improve readability of the extra message errors in assertEqual (#87202)"
This reverts commit 56c28ee32a78eb6f32a533d8fd64278cb9063016.

Reverted https://github.com/pytorch/pytorch/pull/87202 on behalf of https://github.com/malfet due to broke test_testing, see 56c28ee32a
2022-10-18 19:34:02 +00:00
e4285f09b9 [inductor] new way to compile f64 libdevice calls (#87189)
Porting over [torchdynamo/#1633](https://github.com/pytorch/torchdynamo/pull/1633)

`torch/_inductor/codegen/triton.py` now defines `libdevice_<function>` variants
of some functions. You can request dispatch to those for
float64 dtypes when using `register_pointwise` by setting
`use_libdevice_for_f64=True`.

Other minor changes:
    - In triton, sigmoid now codegens tl.sigmoid
    - silu now comes from decomp, not lowering
    - Some test skips no longer necessary, removed or made xfails

Switching to `tl.sigmoid` has exactly same performance.
Moving `silu` to decomp does not change anything, same triton code is generated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87189
Approved by: https://github.com/ngimel
2022-10-18 19:13:11 +00:00
c56be31d2e Upgrade oneDNN to v2.7 (#87061)
This PR is to upgrade oneDNN to v2.7.

### oneDNN v2.7 changes:

**Performance Optimizations**
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
- Introduced performance optimizations for [bf16 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.

Please go to https://github.com/oneapi-src/oneDNN/releases/tag/v2.7 for more detailed changes.

### oneDNN v2.6.1 & 2.6.2 changes:

**Functionality**

- Updated ITT API to 3.22.5
- Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (https://github.com/pytorch/pytorch/issues/84488)

### Performance Benchmark
Use TorchBench test in ICX with 40 cores
Intel OpenMP & tcmalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/196121957-656faebc-9f4a-49f0-9ef0-0784416c3a47.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87061
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/weiwangmeta
2022-10-18 19:07:58 +00:00
2485498294 [FSDP] Use all_gather_into_tensor() (#87077)
Let us silence some warnings 👍🏼
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87077
Approved by: https://github.com/rohan-varma
2022-10-18 18:54:23 +00:00
56c28ee32a Improve readability of the extra message errors in assertEqual (#87202)
Goes from (note the `linspace.default` is very difficult to find)
```
Mismatched elements: 15 / 50 (30.0%)
Greatest absolute difference: 1 at index (17,)
Greatest relative difference: 1.0 at index (17,) : linspace.default
args = (0, -3, 50)
kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
'pin_memory': False}
```
to
```
Mismatched elements: 15 / 50 (30.0%)
Greatest absolute difference: 1 at index (17,)
Greatest relative difference: 1.0 at index (17,)
linspace.default
args = (0, -3, 50)
kwargs = {'dtype': torch.int16, 'device': device(type='cpu'),
'pin_memory': False}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87202
Approved by: https://github.com/ezyang
2022-10-18 18:53:03 +00:00
48f0231223 Fix Scalar(bool) handling in toIValue (#87179)
At the moment, they were casted to `int64`, which breaks quite a few
casting rules for example in `ops.aten`.

Quite a vintage bug, circa 2020.

With this fix, the following code prints `torch.bool`, rather than `torch.int64`.
```python
import torch
msk = torch.tensor([False])
b = torch.tensor([False])
print(torch.ops.aten.where.ScalarSelf(msk, True, b).dtype)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87179
Approved by: https://github.com/albanD
2022-10-18 18:53:03 +00:00
4540330f97 Revert "Use conda-forge in mac mps test (#87155)"
This reverts commit 74138a8daa93ec4cb08e4dd31c2773ec0c751d94.

Reverted https://github.com/pytorch/pytorch/pull/87155 on behalf of https://github.com/huydhn due to Conda issue has been resolved upstream https://github.com/pytorch/pytorch/issues/87148
2022-10-18 18:29:17 +00:00
adc7ee09dc Added upsample_nearest3d/1d lowering to inductor (#87158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87158
Approved by: https://github.com/ngimel
2022-10-18 18:27:56 +00:00
d7801a6042 Add voznesenskym to CODEOWNERS (#87227)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87227
Approved by: https://github.com/jansel
2022-10-18 18:24:13 +00:00
88b76ae9ea Store type(module) in the module stack (#87149)
- As requested by quantization team, it prefer storing type(module) in the module stack.
- Consequently, as module stack gets verbose, we skip printing module stack in the gm.print_readable()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87149
Approved by: https://github.com/jerryzh168, https://github.com/jansel
2022-10-18 18:12:37 +00:00
d01eea6027 Do not run triton tests on sm86 (#87198)
As its broken right now and nobody care to fix it, see this test run for example: d36c284d14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87198
Approved by: https://github.com/soumith, https://github.com/albanD
2022-10-18 17:19:52 +00:00
2b03a941f7 [dynamo] graph capture for calls to arbitrary self. methods on nn module (#87040)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87040
Approved by: https://github.com/jansel
2022-10-18 16:54:40 +00:00
09a967d6c9 Make nested TreeSpec printing nicer (#46538) (#86546)
1. Made TreeSpec into a dataclass.
2. In `__repr__`, recursively transformed TreeSpec into dictionaries and then pretty-printed it.

Fixes #46538. Hi, @ezyang. this PR is for the TreeSpec `__repr__` refactor we discussed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86546
Approved by: https://github.com/ezyang
2022-10-18 16:50:39 +00:00
440f734169 [inductor] Minifier fixes (#87062)
Fixes https://github.com/pytorch/torchdynamo/issues/1690

This fixes the error seen in the minifiers. But does not repro the original issue that prompted the above issue.

Fx minifiers work at the level of Fx-graphs, and the original issue lies outside of the Fx graph and is only visible on the second iteration. Therefore, the original issue escapes the abstraction of our existing Fx-based minifiers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87062
Approved by: https://github.com/eellison
2022-10-18 15:53:55 +00:00
c30cfb07ab [dynamo][dashboard] Run 2 iterations for the correctness runs (#87104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87104
Approved by: https://github.com/soumith
2022-10-18 15:53:40 +00:00
d29dc2b72a Temporarily disable ios jobs (#87186)
While investigating segfault issue:

* https://app.circleci.com/pipelines/github/pytorch/pytorch/584349/workflows/6c68b0ce-023e-4f62-83bf-e77962daf8ad/jobs/17180595
* https://github.com/pytorch/pytorch/actions/runs/3269860268/jobs/5377851127

This might be related to the use of conda-forge in https://github.com/pytorch/pytorch/issues/87148, i.e. conda-forge pulls in different version of some dependencies and breaks thing.  If that's the case, we could not revert conda-forge change yet because the checksum issue hasn't been fixed upstream yet (Test PR https://github.com/pytorch/pytorch/pull/87185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87186
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2022-10-18 15:27:27 +00:00
ecd25df313 Add prototype warning to MaskedTensor constructor (#87107)
When a user constructs a MaskedTensor we should signal its development status to set expecations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87107
Approved by: https://github.com/bhosmer
2022-10-18 15:24:18 +00:00
240bba7ac8 add sym_int (#86916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86916
Approved by: https://github.com/ezyang
2022-10-18 14:54:34 +00:00
157310c85d [inductor][triton] if device is a torch.device, then make cuda_properties index it correctly (#87174)
Without this, I was running into obvious `KeyError`s that were assuming that the device was an integer when running `examples/imagenet`.

```python
(pytorch) soumith@bluebox:~/code/examples/imagenet$ python main.py --gpu 0 /home/soumith/dataset/imagenet
/home/soumith/code/vision/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
  warn(f"Failed to load image Python extension: {e}")
/home/soumith/code/examples/imagenet/main.py💯 UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
  warnings.warn('You have chosen a specific GPU. This will completely '
Use GPU: 0 for training
=> creating model 'resnet18'
make_fallback(aten.unfold): a decomposition exists, we should switch to it
make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
Traceback (most recent call last):
  File "/home/soumith/code/pytorch/torch/_inductor/graph.py", line 254, in call_function
    return lowerings[target](*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped
    return decomp_fn(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2994, in var_
    diffs = square(sub(x, mean(x, axis, keepdim=True)))
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped
    return decomp_fn(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2983, in mean
    sum_result = sum_(x, axis, keepdim)
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 202, in wrapped
    return decomp_fn(*args, **kwargs)
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 3211, in sum_
    return fn(x, axis, keepdims, dtype=dtype)
  File "/home/soumith/code/pytorch/torch/_inductor/lowering.py", line 2953, in inner
    result = Reduction.create(
  File "/home/soumith/code/pytorch/torch/_inductor/ir.py", line 714, in create
    hint, split = cls.num_splits(
  File "/home/soumith/code/pytorch/torch/_inductor/ir.py", line 454, in num_splits
    num_sm = get_device_properties(device).multi_processor_count
  File "/home/soumith/code/pytorch/torch/_inductor/cuda_properties.py", line 43, in get_device_properties
    return _properties()[_device(device)]
KeyError: device(type='cuda', index=0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87174
Approved by: https://github.com/yf225
2022-10-18 14:08:01 +00:00
dbccccb7a2 [BE] Get rid of deprecation warnings in workflows (take 3) (#87152)
- Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job)
- Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning
- Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well)
- Update `retry` action to 3e91a01664
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb
2022-10-18 13:53:30 +00:00
9ac2a06acf istft: require complex input (#86628)
Real dtype input to `torch.istft` has been deprecated since PyTorch
1.8, so it is more than passed its due date to be removed.

BC-breaking message:

`torch.istft` no longer supports input in the form of real tensors
with shape `(..., 2)` to mimic complex tensors. Instead, convert
inputs to a complex tensor first before calling `torch.istft`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86628
Approved by: https://github.com/mruberry
2022-10-18 12:03:55 +00:00
b886cd15f5 [primTorch] Add a ref for NumPy-style T (#86850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86850
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-18 10:19:47 +00:00
f2ec9fbd03 torch.ormqr: backward support (#86800)
Seems good to have, especially when neither `a` nor `tau` requires grads and/or they are pretty small in number.
Fixes https://github.com/pytorch/pytorch/issues/86267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86800
Approved by: https://github.com/lezcano
2022-10-18 09:07:35 +00:00
841995d53b [primTorch] Add refs for data conversion ops (#86561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86561
Approved by: https://github.com/lezcano, https://github.com/mruberry, https://github.com/zou3519
2022-10-18 08:38:51 +00:00
731b4bf0f1 Revert "Check all CUDA API calls in aten/src/ATen/test for errors (#74919) (#83556)"
This reverts commit a7ed398cf6bca767d93c6d81f3ecf4198e1b52e0.

Reverted https://github.com/pytorch/pytorch/pull/83556 on behalf of https://github.com/huydhn due to Sorry for revert your PR, but I think it breaks cuda tests a7ed398cf6.  This should not have been force merged
2022-10-18 08:14:15 +00:00
8b0cc9c752 [inductor] Fix copysign issue in old msvc build (#87117)
Should fix https://github.com/pytorch/pytorch/pull/87028#issuecomment-1281066036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87117
Approved by: https://github.com/DanilBaibak
2022-10-18 06:06:31 +00:00
11915b3196 Revert "[BE] Get rid of deprecation warnings in workflows (#87152)"
This reverts commit 9da032ecee8b0c7a5ce822bb4425af9208dc2fa1.

Reverted https://github.com/pytorch/pytorch/pull/87152 on behalf of https://github.com/malfet due to Regresses is_pr_labelled workflow again
2022-10-18 05:32:46 +00:00
d36c284d14 [triton] allow cuda properties to be queried from workers (#87101)
Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.

Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.

This just moves the needed properties from the main trainer process to the workers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
Approved by: https://github.com/soumith
2022-10-18 04:48:29 +00:00
9da032ecee [BE] Get rid of deprecation warnings in workflows (#87152)
- Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job)
- Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning
- Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well)
- Update `retry` action to 3e91a01664
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb
2022-10-18 04:34:58 +00:00
66658e1da7 Revert "[BE] Get rid of deprecation warnings in workflows (#87152)"
This reverts commit acaf484f0a38f6a7becf342bb3492e1de09f64e1.

Reverted https://github.com/pytorch/pytorch/pull/87152 on behalf of https://github.com/malfet due to Regresses is_pr_labelled workflow
2022-10-18 04:14:01 +00:00
8ca7820e45 [Inductor] Lift the maximum depth of the Python interpreter stack to adapt large/deep models (#87130)
Partly fixes https://github.com/pytorch/torchdynamo/issues/1693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87130
Approved by: https://github.com/jansel
2022-10-18 03:46:01 +00:00
acaf484f0a [BE] Get rid of deprecation warnings in workflows (#87152)
- Per [deprecation announcement](https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/) replace `echo "::set-output name="` with echo to `${GITHUB_OUTPUT}` as shown in following [example](https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs#example-defining-outputs-for-a-job)
- Update `actions/setup-python` from `v2` to `v4` to get rid of deprecated node version warning
- Update `actions/checkout-python` from `v2` to `v3` (and `silent-checkout` branch as well)
- Update `retry` action to 3e91a01664
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87152
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb
2022-10-18 03:38:24 +00:00
5fb687182d Enable sdp_forward for NestedTensors (#86720)
# Summary
This PR implements a sdp_forward for NestedTensors. This impl will call into flash and mem_efficient_attention when possible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86720
Approved by: https://github.com/cpuhrsch
2022-10-18 02:00:04 +00:00
74138a8daa Use conda-forge in mac mps test (#87155)
https://github.com/pytorch/pytorch/pull/87150 works, most of the jobs are ok now.  However, I miss one last piece in MPS test workflow https://github.com/pytorch/pytorch/actions/runs/3269594289/jobs/5377469209.  So this fixes the missing piece to use conda-forge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87155
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2022-10-18 01:14:07 +00:00
9d1a8edc0e [vulkan] Use 2D texture types for convolution weights and biases (#86972)
Differential Revision: [D40385500](https://our.internmc.facebook.com/intern/diff/D40385500/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86972
Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign
2022-10-18 00:55:19 +00:00
5b588036aa [vulkan] Enable 2D texture types (#86971)
Adds the ability to use 2D GPU textures to represent tensors. The `StorageType` enum can be used to represent other representation modes in the future, such as buffer representations, etc.

Differential Revision: [D40363112](https://our.internmc.facebook.com/intern/diff/D40363112/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86971
Approved by: https://github.com/kirklandsign
2022-10-18 00:52:00 +00:00
a7ed398cf6 Check all CUDA API calls in aten/src/ATen/test for errors (#74919) (#83556)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74919

Test Plan: Sandcastle

Differential Revision: D35194596

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83556
Approved by: https://github.com/malfet
2022-10-18 00:35:44 +00:00
f02f0e3ad1 Install blas from conda-forge (#87150)
Mitigate https://github.com/pytorch/pytorch/issues/87148

### Testing

On AWS (m1, linux)

* Run `conda install blas:openblas`, it should failed with `ChecksumMismatchError`:

```
ChecksumMismatchError: Conda detected a mismatch between the expected content and downloaded content
for url 'https://repo.anaconda.com/pkgs/main/linux-64/blas-1.0-openblas.conda'.
  download saved to: /tmp/debug/pkgs/blas-1.0-openblas.conda
  expected sha256: c85b5d0a336b5be0f415c71fd7fe2eca59e09f42221bfa684aafef5510ba5487
  actual sha256: 5dc5483db0d9785b19e021cee418a8ee03e0ff0e5ebd0b75af4927746604e187
```

* Run ` conda install -c conda-forge blas:openblas` works

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87150
Approved by: https://github.com/kit1980
2022-10-18 00:11:37 +00:00
9db7270ee7 Small update to Module note (#87142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87142
Approved by: https://github.com/cpuhrsch
2022-10-17 22:56:49 +00:00
fb614b1871 Enable UBSAN mode for test_jit (#85735)
# Summary

Run `test_jit` executable with UBSAN flag in order to catch errors that might cause internal breakage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85735
Approved by: https://github.com/dagitses
2022-10-17 22:15:50 +00:00
18cc00d399 [ci] put more logs in a folded group (#86138)
fixes: request to not print the entire log file, but the last couple of lines since they are probably the most relevant

all but last 300 lines of failing tests get put into a folded group
example https://github.com/pytorch/pytorch/actions/runs/3177200444/jobs/5177703202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86138
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/lezcano
2022-10-17 22:10:23 +00:00
e3b84f6c9d remove dynamo hash updates (#87092)
remove workflow for updating dynamo hash as it got moved into this repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87092
Approved by: https://github.com/huydhn
2022-10-17 22:09:56 +00:00
4fd98dfe69 Don't only apply DDP optimizer on forward frames (#87097)
Previously a check would only apply DDP optimizer on frames named "forward".

But on hf_T5_large, a graph break causes some frames like:

```
<graph break in _shift_right>
<graph break in forward>
```

So instead, apply DDP optimizer on all frames.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87097
Approved by: https://github.com/wconstab
2022-10-17 21:55:14 +00:00
09d720919e Add venv to gitignore (#86702)
`venv` is the common directory for creating virtual environments. Adding it to gitignore to support development that does not use anaconda to manage envs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86702
Approved by: https://github.com/kit1980
2022-10-17 21:50:03 +00:00
0cb273b5d9 [DataPipe] Fixing interface generation in setup.py (#87081)
Based on the artifact generated on this [page](https://hud.pytorch.org/pr/87081), I downloaded [[s3] linux-focal-py3.7-clang7-asan/artifacts.zip](https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/3266430083/linux-focal-py3.7-clang7-asan/artifacts.zip) (1.14 GB) and unpacked it. `torch.utils.data.datapipes.datapipe.pyi` does exist. I believe this means the file should be part of the distribution.

I also did `wheel unpack ***.whl` to confirm the existence of the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87081
Approved by: https://github.com/ejguan
2022-10-17 21:45:33 +00:00
f5ee2d8840 [ci] fix bot comment (#87127)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87127
Approved by: https://github.com/clee2000
2022-10-17 21:27:21 +00:00
f552eee427 [Docs] Remove outdated comment for sparse all-reduce (#87018)
https://github.com/pytorch/pytorch/pull/23917 switched to using allgatherv instead of allgather for gloo sparse all-reduce. This PR removes a comment saying to use allgatherv if available since that has already been done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87018
Approved by: https://github.com/H-Huang
2022-10-17 21:17:07 +00:00
d023e83933 handle libomp update on circleci (#86979)
libomp got an update and now its keg only

reverts https://github.com/pytorch/pytorch/pull/86940
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86979
Approved by: https://github.com/huydhn, https://github.com/malfet
2022-10-17 21:03:42 +00:00
5acf6e0e80 Use 12xlarge for nightly cpp doc generation job (#86859)
The job starts to run out of memory a lot recently https://hud.pytorch.org/failure/Process%20completed%20with%20exit%20code%20137.  Probably more and more docs are added, so this ups the runner for cpp doc nightly from 4xlarge to the next tier of 12xlarge. This also choose the smaller runner of 2xlarge for python and functorch docs (may be linux.large is good enough for them?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86859
Approved by: https://github.com/malfet
2022-10-17 20:57:55 +00:00
4814270708 [dynamo] Introduce get_real_value API to TensorVariable (#87091)
Right now, example_value is doing two jobs:
- We use it to propagate metadata (e.g. return type, shapes, etc.)
  throughout the graph
- We use it to satisfy queries for the actual value (e.g. torch.cond,
  `assume_constant_result`)

This is further complicated by the fact that we have two modes, one
where `example_value` is a fake tensor, and one where it is a real
tensor (this is the `fake_tensor_propagation` config flag).

This leads to scenarios where we don't support every combination of
job + mode,
e.g. if `fake_tensor_propagation=False`, `assume_constant_result` is
broken.

This is made worse by the fact that "fake tensor mode" is the default
and is required if you want dynamic shapes to work.

So, this PR introduces a `get_real_value` API that just runs the graph
up to `node` in order to get a concrete value. This API is orthogonal
to
`example_value`, so it doesn't care about `fake_tensor_propagation`.

When `fake_tensor_propagation=True`: `example_value` is a fake tensor,
you must use the `get_real_value` API to get a concrete value. This
will
be the only configuration in the future.

When `fake_tensor_propagation=False`: `example_value` and
`get_real_value` will produce the same value. This is redundant but we
will be removing this config soon.

To support this, I introduce a cache for computed real values, to
memoize the work involved if we're asking for real values a lot.

I attached this state to `OutputGraph` because it seems to be what
historically managed `example_value` lifetimes, but idk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87091
Approved by: https://github.com/wconstab
2022-10-17 20:14:43 +00:00
e85dbcc9b0 [docs] Fix ScalarTensor __repr__ in Extending PyTorch example (#86330)
This PR fixes the __repr__ of the `ScalarTensor` class in the Extending PyTorch example to correspond with the class name instead of `DiagonalTensor`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86330
Approved by: https://github.com/bdhirsh
2022-10-17 20:01:10 +00:00
b8007742c2 [Dynamo] More robust pyop support, module properties as args (#87020)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87020
Approved by: https://github.com/jansel
2022-10-17 19:55:39 +00:00
1167949b2d [ONNX] Ignore print(Tensor) during tracing (#86223)
Fixes #73619
Fixes https://github.com/microsoft/onnxruntime/issues/11812

This PR adds new symbolics: `aten::_conj`, `aten::conj_physical`, `aten::resolve_conj`, and `aten::resolve_neg`
While the last two are always NO-OP by definition (do not change nodes), the first raises an exception as they are not supported by ONNX yet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86223
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2022-10-17 19:45:33 +00:00
31931515bc Workarounds for cudnn_batch_norm with TorchRefsNvfuserCapabilityMode (#86796)
This PR adds workarounds to support AOT Autograd's graphs containing `aten.cudnn_batch_norm` and `aten.cudnn_batch_norm_backward` with `TorchRefsNvfuserCapabilityMode`.

The problem with the decomposition of `aten.cudnn_batch_norm` is that it uses a `new_empty` call that is not supported by nvFuser and we are conservative with lowering functions to nvprims by default.

The problem with the decomposition of `aten.cudnn_batch_norm_backward` is described here https://github.com/pytorch/pytorch/pull/86115#issue-1394883782, but changing the decomposition directly in that PR makes many tests fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86796
Approved by: https://github.com/mruberry
2022-10-17 18:46:28 +00:00
33343def0b add XLA backend into tensor type strings (#86881)
add XLA backend into tensor type strings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86881
Approved by: https://github.com/bdhirsh
2022-10-17 18:27:49 +00:00
317eeb81c3 Revert "OpInfo: Sample input cleanup (4/n) (#86324)"
This reverts commit 2a6d37d23d163a35c0b62c4319a6c2f049a27833.

Reverted https://github.com/pytorch/pytorch/pull/86324 on behalf of https://github.com/peterbell10 due to Caused tolerance issues in periodic test
2022-10-17 18:26:59 +00:00
8f85831fdf Give more clear error message when gscope is non-empty (#87005)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87005
Approved by: https://github.com/alanwaketan, https://github.com/Krovatkin
2022-10-17 18:17:01 +00:00
c01c7a5e2c [DataPipe] Fix missing functional name for FileLister (#86497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86497
Approved by: https://github.com/ejguan
2022-10-17 18:13:37 +00:00
c27a5171b8 Update action lint with missing new runners from scale-config (#87009)
Using runner label like `linux.12xlarge` results in linter failure from actionlint, i.e. https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952

```
Error (ACTIONLINT) [runner-label]
    label "linux.12xlarge" is unknown. available labels are "windows-
    latest", "windows-2022", "windows-2019", "windows-2016", "ubuntu-
    latest", "ubuntu-22.04", "ubuntu-20.04", "ubuntu-[18](https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952#step:7:19).04", "macos-latest",
    "macos-12", "macos-12.0", "macos-11", "macos-11.0", "macos-10.15",
    "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows",
    "linux.[20](https://github.com/pytorch/pytorch/actions/runs/3253740221/jobs/5341281952#step:7:21)_04.4x", "linux.20_04.16x", "linux.large", "linux.2xlarge",
    "linux.4xlarge", "linux.4xlarge.nvidia.gpu", "linux.8xlarge.nvidia.gpu",
    "linux.16xlarge.nvidia.gpu", "windows.4xlarge",
    "windows.8xlarge.nvidia.gpu", "bm-runner", "linux.rocm.gpu", "macos-m1-
    12", "macos-12-xl", "macos-12", "macos12.3-m1". if it is a custom label
    for self-hosted runner, set list of labels in actionlint.yaml config file

         47  |            # an OOM issue when running the job, so this upgrades the runner from 4xlarge
         48  |            # to the next available tier of 12xlarge. So much memory just to generate cpp
         49  |            # doc
    >>>  50  |            runner: linux.12xlarge
         51  |            # Nightly cpp docs take about 150m to finish, and the number is stable
         52  |            timeout-minutes: 180
         53  |          - docs_type: python
```

`linux.12xlarge` is a valid runner label from https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml. This also adds `linux.24xlarge` and `linux.g5.4xlarge.nvidia.gpu`, which are also not added yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87009
Approved by: https://github.com/ZainRizvi
2022-10-17 17:39:19 +00:00
1704256b10 Enables where to have cpu scalar args (#87022)
This is for decompositions only, no attempt made to have good performance for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87022
Approved by: https://github.com/ezyang, https://github.com/eellison, https://github.com/mruberry
2022-10-17 17:08:47 +00:00
f3969bd8b5 [functorch] Fix cross to match unbatched behavior (#86926)
Fixes #83936 #83907

In #83936, I noticed that after I wrote cross, it's silently incorrect because I misunderstood what the fix to linalg was going to be. This fixes functorch to not be silently incorrect with `linalg.cross`. Since it's a silent correctness issue that I missed, I'm hoping to cherry pick it too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86926
Approved by: https://github.com/zou3519
2022-10-17 16:56:21 +00:00
e271e823c7 Avoid calling logging.basicConfig (#86959)
Fixes https://github.com/pytorch/pytorch/issues/85952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86959
Approved by: https://github.com/xwang233, https://github.com/davidberard98
2022-10-17 16:45:21 +00:00
6351220573 Add meta support for _adaptive_avg_pool2d_backward (#86359) (#87074)
This reverts commit 3edf79dc03193c98b665d62231fe69a10dfab1fa.

Reland of https://github.com/pytorch/pytorch/pull/86359
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87074
Approved by: https://github.com/ezyang
2022-10-17 16:15:04 +00:00
66715767ff Revert "[Dynamo] More robust pyop support, module properties as args (#87020)"
This reverts commit 3c320a5613c26aa3568c330ae1c34a03dadf2b5c.

Reverted https://github.com/pytorch/pytorch/pull/87020 on behalf of https://github.com/ZainRizvi due to This appears to have caused two periodic tests to fail
2022-10-17 16:02:49 +00:00
8617f5f481 fix cudagraphify for inplace parameter change (#87060)
Fixes https://github.com/pytorch/torchdynamo/issues/1687
cc @albanD, @chillee, I don't know what I'm doing.
According to previous discussions, calling `detach()` on inputs can cause bugs if inputs are later inplace-resized (cc @ezyang) https://github.com/pytorch/pytorch/pull/85301/files#diff-8678402e01603e588fcf175a61de9ed578d885b1cc082e028021856190223fb7L433, but should we weed out these patterns before they are sent to cudagraphify?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87060
Approved by: https://github.com/jansel, https://github.com/albanD
2022-10-17 15:59:05 +00:00
2c6167c4bb Revert "[inductor] Use decomps for unfold (#87025)"
This reverts commit 5099883f059a9b15592b8ba3b7bf83145163b966.

Reverted https://github.com/pytorch/pytorch/pull/87025 on behalf of https://github.com/ZainRizvi due to Breaks periodic tests
2022-10-17 15:44:15 +00:00
2b558138cf [inductor] Set correct strides in fallback example run (#87049)
Fixes #ISSUE_NUMBER

Helps in resolving many issues seen in https://github.com/pytorch/torchdynamo/issues/1675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87049
Approved by: https://github.com/jansel
2022-10-17 15:43:53 +00:00
4e5357faf5 ATen/native (2/6): Use per-operator headers (#75572)
Differential Revision: [D40126702](https://our.internmc.facebook.com/intern/diff/D40126702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75572
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
2022-10-17 15:27:02 +00:00
b40f4434ac conv backward impl (#87047)
~~Waiting for test run to see if this backward is actually exercised.
If not, I will add test before merging.~~
Test updated. Ready to go now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87047
Approved by: https://github.com/ezyang
2022-10-17 13:14:12 +00:00
1463013c85 autograd clone_obey_contract() symint support (#87044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87044
Approved by: https://github.com/ezyang
2022-10-17 13:14:12 +00:00
86c2e44cb6 meta funcs for avg_pool2d and avg_pool2d_backward (#87043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87043
Approved by: https://github.com/ezyang
2022-10-17 13:14:10 +00:00
c21dcffc00 Very limited pow support (#87042)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87042
Approved by: https://github.com/ezyang
2022-10-17 13:14:07 +00:00
37e9e89afb [xla hash update] update the pinned xla hash (#87067)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87067
Approved by: https://github.com/pytorchbot
2022-10-17 10:55:45 +00:00
91b3cd0b5a [primTorch] Add a ref for narrow_copy (#86748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86748
Approved by: https://github.com/mruberry
2022-10-17 10:16:05 +00:00
847ded6db3 [primTorch] Implement NLL loss reference (#81128)
Add Reference:
- nll_loss

Depends on:
- expand https://github.com/pytorch/pytorch/pull/79820
- advance indexing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81128
Approved by: https://github.com/mruberry
2022-10-17 06:20:31 +00:00
78e2289005 [TorchInductor] enable inplace buffers by default (#87037)
This PR enables the inplace_buffers configuration by default after fixing issue: https://github.com/pytorch/torchdynamo/issues/1670. UT is added to cover the fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87037
Approved by: https://github.com/jansel
2022-10-17 06:05:30 +00:00
1b43883fd6 Make AdamW, NAdam & RAdam differentiable (#86183)
Blocked by #86096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86183
Approved by: https://github.com/albanD
2022-10-17 04:32:08 +00:00
364a9973ca [vision hash update] update the pinned vision hash (#87021)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87021
Approved by: https://github.com/pytorchbot
2022-10-17 03:17:03 +00:00
3a4c0900c7 Reland 3 of Merge more symbolic meta kernels and symint changes from branch (#86795)
Take 3
Contains:
- symintification of split*
- floor support on SymFloat
- pad_backward, gather, scatter meta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86795
Approved by: https://github.com/z-a-f
2022-10-17 02:09:40 +00:00
0379af681b [inductor] Disable parallel compile (#87048)
https://github.com/pytorch/pytorch/pull/87032 seems to have an issue that breaks our benchmark script, it might have to do with the benchmark script also using subprocess.

Before this PR:
```
$ ./benchmarks/dynamo/torchbench.py --performance --inductor --raise --training --float16
...
Traceback (most recent call last):
  File "/home/jansel/conda/envs/pytorch/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 239, in _worker_compile
    kernel = TritonCodeCache.load(source_code)
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 234, in load
    mod = PyCodeCache.load(source_code)
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 212, in load
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_jansel/ij/cij7smji4sw2a56i4yz45bjkrosd2sb2raqnxzsxxpg4kwzuo2ta.py", line 5, in <module>
    from torch._inductor.triton_ops.autotune import reduction
  File "/home/jansel/pytorch/torch/_inductor/triton_ops/__init__.py", line 3, in <module>
    if has_triton():
  File "/home/jansel/pytorch/torch/_inductor/utils.py", line 38, in has_triton
    return triton is not None and torch.cuda.get_device_capability() >= (7, 0)
  File "/home/jansel/pytorch/torch/cuda/__init__.py", line 368, in get_device_capability
    prop = get_device_properties(device)
  File "/home/jansel/pytorch/torch/cuda/__init__.py", line 382, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/home/jansel/pytorch/torch/cuda/__init__.py", line 228, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```

cc @zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87048
Approved by: https://github.com/soumith
2022-10-17 01:02:43 +00:00
3007efda08 stft: Require return_complex to be passed explicitly for real input (#86724)
This behavior has been deprecated since PyTorch 1.8 but this step of
the deprecation cycle was put on hold in #50102 waiting for JIT
upgraders functionality which doesn't seem to have panned out. I'd say
there has been more than enough of a deprecation period, so we should
just continue.

BC-breaking message:

`torch.stft` takes an optional `return_complex` parameter that
indicates whether the output should be a floating point tensor or a
complex tensor. `return_complex` previously defaulted to `False` for
real input tensors. This PR removes the default and makes
`return_complex` a required argument for real inputs. However, complex
inputs will continue to default to `return_complex=True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86724
Approved by: https://github.com/mruberry, https://github.com/albanD
2022-10-16 22:26:35 +00:00
2b7236a0e1 [torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032)
This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation
workers.

Previously os.fork overhead and GIL contention limited the achieved
parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation,
and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway
since it is mostly in python.

In cold start situations, the time to get the worker threads started can
be significant portion of the time.
This patch starts the workers earlier so they are ready to perform
compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark (tf_efficientnet_b0), but the
results are significant, almost eliminating the difference between a
warm and cold compilation.

```
39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
 (cold compilation is done after running `rm -rf
/tmp/torchinductor_$USER`).ompiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.

Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.

In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.

Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.

```
39.613s - warm
41.290s - cold, this patch

2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
 (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032
Approved by: https://github.com/soumith, https://github.com/jansel
2022-10-16 21:58:26 +00:00
945d333ae4 Migrate dynamo CI test shards to torch._dynamo (#87039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87039
Approved by: https://github.com/voznesenskym
2022-10-16 21:35:57 +00:00
30f6f6903c [inductor] Move size asserts to C++, fix bug (#87028)
Inductor internally models any `size=1` dimension as having `stride=0` to simplify indexing formulas (sympy will remove these terms from the expression).

This caused a bug in our generate stride assert in detectron2_maskrcnn_r_50_fpn, where we asserted the wrong stride of a size==1 dimension.

This fixes that bug, and moves size/stride assert logic to C++ which should be a small perf gain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87028
Approved by: https://github.com/anijain2305
2022-10-16 20:17:22 +00:00
d45e99acf5 [dynamo] Put printing graph breaks behind a config option (#87026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87026
Approved by: https://github.com/soumith, https://github.com/voznesenskym
2022-10-16 19:53:42 +00:00
2a6d37d23d OpInfo: Sample input cleanup (4/n) (#86324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86324
Approved by: https://github.com/mruberry
2022-10-16 19:12:44 +00:00
5099883f05 [inductor] Use decomps for unfold (#87025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87025
Approved by: https://github.com/soumith
2022-10-16 17:10:33 +00:00
8a8cd092c8 Add labeler with dynamo/inductor paths to start (#87024)
The other missing ingredient is getting CC bot to work on labels on PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87024
Approved by: https://github.com/soumith, https://github.com/jansel
2022-10-16 06:13:18 +00:00
a0c2a7f2ed Add triton to CI (#86988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86988
Approved by: https://github.com/malfet, https://github.com/voznesenskym, https://github.com/soumith
2022-10-16 03:35:36 +00:00
3c320a5613 [Dynamo] More robust pyop support, module properties as args (#87020)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87020
Approved by: https://github.com/jansel
2022-10-16 02:15:10 +00:00
5d6e831563 OpInfo: Sample input cleanup (3/n) (#86380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86380
Approved by: https://github.com/mruberry
2022-10-15 22:14:09 +00:00
054a2fd6c2 Sync changes from pytorch/torchdynamo (#87013)
This updates to:
6380959be2

Generated with:
https://github.com/pytorch/torchdynamo/blob/main/copy_to_core.sh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87013
Approved by: https://github.com/voznesenskym
2022-10-15 21:00:57 +00:00
2c1bc216b8 Fixed partitioner issue with getitem and made metadata a storage more consistent (#87012)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87012
Approved by: https://github.com/ngimel
2022-10-15 17:58:55 +00:00
91c7015426 [einsum] Fix opt_einsum defaults to be more reasonable (#86985)
Fixes the confusing situation mentioned here https://github.com/pytorch/pytorch/issues/85224#issuecomment-1278628262 by

- setting better OG defaults
- changing warnings to errors now that we have better defaults

Test plan:
- Ran einsum tests locally + CI
- Uninstalled opt-einsum and ran through setting
     - `enabled` to False (doesn't throw error)
     - `strategy` to anything that's not None (errors)
     - `strategy` to None (noops)
- Installed opt-einsum and ran through setting
     - `enabled` to False (doesn't throw error)
     - `enabled` to True (doesn't throw error, no ops + defaults to 'auto')
     - `strategy` to random string (errors)
     - `strategy` to None (noops, still is 'auto')
     - `strategy` to 'greedy' (is set to 'greedy')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86985
Approved by: https://github.com/soulitzer
2022-10-15 06:23:50 +00:00
7980ed95bd Support unpacking python dictionary in torch.jit.trace() (#81623)
# Support unpacking python dictionary in **torch.jit.trace()**

## Problem statement & Motivation
### Problem 1(usability):
Say, if you have a model and its forward method defined as follows:
**`def forward(self, key1=value1, key2=value2, key3=value3)`**
And you have a dataset and each data point in the dataset is a python dict as follows:
**`data = {key1:value1, key3:value3, key2:value2}`**

The problem is that if you want to trace the model using the dict data by the giving dataset, you need unpack the dictionary and reorder its value manually and make up a tuple as **`data_tuple = (value1, value2, value3)`** as the **`example_inputs`** parameter of **`torch.jit.trace()`**. This marshalling process is not user friendly.

### Problem 2 (feasibility):
Say, if you have a model and its forward method defined as follows:
**`def forward(self, key1=None, key2=None, key3=None)`** -> The default value is **None**
And you have a dataset and each data point in the dataset is a python dict as follows:
**`data = {key1:value1, key3:value3}`** -> Only **part of** the required value by forward was given, the rest use the default value.

The problem is that if you want to trace the model using the dict data by the giving dataset, it's not feasible at all. Cause neither you can pass a tuple like **`T1 = (value1, value3)`**  nor **`T2 = (value1, None, value3)`**. T1 will mismatch value3 with key2 and T2 include **None** type which will be blocked by tracer's type checking. (Of course you can pass **`T3 = (value1,)`**  to make the trace function finish without exception, but the traced model you get probably is not what you expect cause the different input may result in different traced result.).

These problems come from the HuggingFace's PT model, especially in text-classification tasks with datasets such as [MRPC,](https://paperswithcode.com/dataset/mrpc)  [MNLI](https://paperswithcode.com/dataset/multinli) etc.

## Solution
To address these two issues, we propose to support a new type, that is, python dict as example_inputs parameter for torch.jit.trace(). We can base on the runtime type information of the example_inputs object to determine if we fall back to the original tuple path or go into the new dictionary path. Both problem 1 and  problem 2 can be solved by utilizing the "**`**`**"
operator.

## Limitation & Mitigation

1. If we use dict as example_inputs to trace the model, then we have to pass a dictionary to the traced model too. (Cause probably we will change the order of debug name of the input parameter in torchscript IR, thus we can't assume the traced model's input parameters order are the same with the original model.). We need highlight this too in the document to mitigate this problem.

    For example:
```
# fetch a data from dataloader, and the data is a dictionary
# and the example_inputs_dict is like: {key1:value1, key3:value3, key2:value2}
# the forward() is like: def forward(self, key1=value1, key2=value2, key3=value3)
example_inputs_dict = next(iter(dataloader))
jit_model = model.eval()
# use the dictionary to trace the model
jit_model = torch.jit.trace(jit_model, example_inputs_dict, strict=False)  # Now the IR will be graph(%self : __torch__.module.___torch_mangle_n.Mymodule, %key1 : type1, %key3 : type3, %key2 : type2)
jit_model = torch.jit.freeze(jit_model)

# It's OK to use dict as the parameter for traced model
jit_model(**example_inputs_dict)

example_inputs_tuple = (value1, value3, value2)
# It's wrong to rely on the original args order.
jit_model(*example_inputs_tuple)

```
## Note
1. This PR will make some UT introduced in [39601](https://github.com/pytorch/pytorch/pull/39601) fail, which I think should be classified as unpacking a tuple containing a single dictionary element in our solution.
4. I think there is ambiguity since currently we only specify passing a tuple or a single Tensor as our example_inputs parameter in **torch.jit.trace()**'s documentation, but it seems we can still passing a dictionary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81623
Approved by: https://github.com/davidberard98
2022-10-15 05:33:09 +00:00
bdefa260b2 [RFC] Separate CPU offload activation to its own wrapper (#85459)
Passing in `offload_to_cpu=True` to checkpoint_wrapper is a bit confusing, because this causes the activation checkpoint args to be ignored and we do CPU offloading. This isn't ideal from API design perspective, so proposing to make `offload_wrapper` its own concept.

Now, offload to CPU + checkpoint can be composed together, such as

```
# apply AC to transformer layers
apply_ac_wrapper(model, checkpoint_wrapper, check_fn=lambda mod: isinstance(mod, TransformerLayer))
# offload the rest of activations to CPU
model = offload_wrapper(model)
```

Will polish / add tests if this proposal sounds good.

Differential Revision: [D39719854](https://our.internmc.facebook.com/intern/diff/D39719854/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85459
Approved by: https://github.com/awgu
2022-10-15 05:19:28 +00:00
100113b877 [quant][docs] Formatting fixes for fx graph mode quantization README (#86914)
Summary:
att

Test Plan:
No code changes involved

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86914
Approved by: https://github.com/vkuzo
2022-10-15 03:45:58 +00:00
f6f1aefb8f [vision hash update] update the pinned vision hash (#86758)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86758
Approved by: https://github.com/pytorchbot
2022-10-15 03:25:05 +00:00
46aaae98c5 torchdynamo: add linear pointwise(binary) fusion kernel (#86583)
Support binary fusion of Linear with:

- add
- sub
- mul
- div

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86583
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-15 01:57:42 +00:00
5210fab64d torchdynamo: add convolution pointwise(binary) fusion kernel (#86582)
Support binary fusion of Convolution with:

- add
- sub
- mul
- div
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86582
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-15 01:55:08 +00:00
9a7a49b254 torchdynamo: add convolution pointwise(unary) fusion kernel (#86581)
Support unary fusion of Convolution with:

- relu
- sigmoid
- tanh
- hardswish
- leaky_relu
- hardtanh
- gelu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86581
Approved by: https://github.com/jgong5, https://github.com/jansel
2022-10-15 01:51:01 +00:00
d5a7e6db38 ATen/native (1/6): Use per-operator headers (#75571)
Differential Revision: [D40126698](https://our.internmc.facebook.com/intern/diff/D40126698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75571
Approved by: https://github.com/malfet
2022-10-15 01:43:26 +00:00
4584d06e76 [data] add autocompletion to datapipes (#86960)
In REPLs (e.g. jupyter notebook) autocomplete now works:

<img width="750" alt="image" src="https://user-images.githubusercontent.com/53842584/195776448-f33180da-d1cd-4e47-b9a0-4fd9eb2f78b7.png">

even with custom data pipes:

<img width="804" alt="image" src="https://user-images.githubusercontent.com/53842584/195776957-5c51895e-f469-4b13-81ba-c9b507022555.png">

Unfortunately I wasn't able to figure out how to get autocomplete to work for non-REPLs (e.g. VSCode) - may need to generate fake pyi stubs, which 1) won't work for custom datapipes and 2) is a larger project to tackle :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86960
Approved by: https://github.com/NivekT
2022-10-15 00:25:26 +00:00
3924aa75b1 [BE] Extend linter to detect DOS newlines (#86973)
Fix DOS newlines in `onednn/decompose_silu.[cpp|h]` introduced by https://github.com/pytorch/pytorch/pull/85591 as well as one in `.github/PULL_REQUEST_TEMPLATE.md`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86973
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb
2022-10-15 00:20:42 +00:00
b8aa1767cd [quant][be] Remove unused helper functions in convert.py (#86913)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86913
Approved by: https://github.com/vkuzo
2022-10-15 00:08:36 +00:00
761ca20dd8 [quant][be] Rename qconfig_map to node_name_to_qconfig (#86861)
Summary:
att, with the introduction of QConfigMapping, this name is now very confusing, so renamed
it to something clearer

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86861
Approved by: https://github.com/vkuzo
2022-10-15 00:08:36 +00:00
8f71e8de7e Sync changes from pytorch/torchdynamo, enable tests (#86950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86950
Approved by: https://github.com/Chillee
2022-10-14 23:08:58 +00:00
78ef40973c Set -Werror=braced-scalar-init (#86911)
- `vector<T>({0})` would give you the vector(size, ...) ctor and produce an empty vector of T, along with the scalar-init warning
- `vector<T>({T(0)})` would give you the vector of a single T(0) as you might have intended, and bypasses the warning/error
- the warning can easily be missed but can have serious consequences, so make it an error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86911
Approved by: https://github.com/albanD
2022-10-14 22:34:36 +00:00
155b885806 [xnnpack][lite-int] preprocess (#86980)
Split up original preprocess diff:

This diff introduces the skeleton structure of the delegate APIs. first introducing the method compile spec error handling. For now it just outputs an empty tensor object upon execute. But just proves that delegate apis is working and a new xnnpack delegate backend has been added.

Differential Revision: [D38562918](https://our.internmc.facebook.com/intern/diff/D38562918/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D38562918/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86980
Approved by: https://github.com/salilsdesai, https://github.com/cccclai
2022-10-14 22:07:12 +00:00
7c73b45621 [onnx] Add support for autograd function inlining in ONNX_ATEN_FALLBACK mode (#85736)
Solution to #85027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85736
Approved by: https://github.com/BowenBao
2022-10-14 21:58:01 +00:00
d29c8c0ffa enable optim tests on dynamo to test flaky bot (#86976)
will link the issue that disabled them if this gets approved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86976
Approved by: https://github.com/albanD
2022-10-14 21:44:13 +00:00
1a7409c771 [CoreML][ios_crash] Use special throw macro when encountering CoreML API errors (#86938)
Error messages from TORCH_CHECK are stripped during production builds via  -DSTRIP_ERROR_MESSAGES. This diff introduces a new macro COREML_CHECK which will always preserve the error message. This macro is used when encountering errors produced by CoreML API calls so that we can heve enough context to debug.

Differential Revision: [D40351013](https://our.internmc.facebook.com/intern/diff/D40351013/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86938
Approved by: https://github.com/salilsdesai
2022-10-14 21:06:25 +00:00
34c86adec4 symintify all of derivatives.yaml (#86610)
Big-bang PR to symintify **all** .sizes() calls in derivatives.yaml, which will be needed for symbolic tracing.

* with the exception of `split()`, which is tougher to land because it requires internal changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86610
Approved by: https://github.com/albanD
2022-10-14 20:15:48 +00:00
d7bbb61f6b min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86609)
This PR shouldn't matter too much, but I figured I'd land it instead of deleting. `PySymInt.min/max` are technically broken today, and this fixes them - but it doesn't matter (yet) because nobody is calling `min()` / `max()` on symints from python (they all happen using `std::min/max` in C++, which desugar to lt / gt calls).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86609
Approved by: https://github.com/albanD
2022-10-14 20:15:48 +00:00
1bb609ad47 Added new test test_compare_cpu that checks if cpu and gpu results are consistent (#85011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85011
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-14 20:15:16 +00:00
e027740e77 Chore: Add 'mps' to the docs of tensor_attributes (#86585)
Since PyTorch supports 'mps' (Apple metal) devices it should be reflected in the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86585
Approved by: https://github.com/albanD
2022-10-14 19:59:33 +00:00
fc3afc8407 Remove empty_like+fill from AOT Autograd graphs for nvFuser (#86908)
AOT Autograd records C++ code `1 - tensor` as a sequence of empty_like, fill, and sub (see https://github.com/pytorch/pytorch/issues/86612).

Both empty_like and fill are not supported yet. This PR is a workaround for enabling fusions of `silu_backward`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86908
Approved by: https://github.com/ngimel
2022-10-14 19:49:39 +00:00
56a744bf47 [ONNX] Reland: Update training state logic to support ScriptedModule (#86745)
In https://github.com/pytorch/pytorch/issues/86325, it was reported that ScriptedModule do not have a training attribute and will fail export because we don't expect them as input.

Also

- Parameterized the test_util_funs test

Thanks @borisfom for the suggestion!

Fixes #86325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86745
Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao
2022-10-14 19:44:47 +00:00
527ebedbff Sparse support for ReLU (#86749)
ReLU support for all sparse layouts, including backward.

Fixes #85208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86749
Approved by: https://github.com/cpuhrsch, https://github.com/nikitaved
2022-10-14 19:16:26 +00:00
ef045695e0 Fix decomp for huber_loss_backward (#86955)
Fixes https://github.com/pytorch/pytorch/issues/86846

aten.huber_loss_backward calls aten.huber_loss_backward.out in its CompositeExplicitAutograd kernel.
The decomp was mistaken registered for both aten.huber_loss_backward.default and aten.huber_loss_backward.out.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86955
Approved by: https://github.com/Chillee
2022-10-14 18:53:02 +00:00
7da018b2f8 [functorch] fix fbcode tests (#86936)
Differential Revision: [D40358418](https://our.internmc.facebook.com/intern/diff/D40358418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86936
Approved by: https://github.com/samdow
2022-10-14 18:42:38 +00:00
f17b3e9b7a Vectorize tensor lerp kernel (#84845)
Fixes #86964

In a simple timeit benchmark I see 1.7x speedup for complex64, from 6.7 us to
3.9 us; and a 3.2x speedup for float32, from 6.2 us to 1.9 us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84845
Approved by: https://github.com/lezcano, https://github.com/malfet
2022-10-14 18:29:02 +00:00
13cff2ee8e [MPS] Copy from CPU always add storageOffset (#86958)
Because why wouldn't it?
Fixes https://github.com/pytorch/pytorch/issues/86052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86958
Approved by: https://github.com/kulinseth
2022-10-14 17:35:18 +00:00
1ece1ab6c2 [ci] print rerun stacktraces for pytest (#86831)
example: https://github.com/pytorch/pytorch/actions/runs/3238428826/jobs/5306808276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86831
Approved by: https://github.com/huydhn
2022-10-14 17:31:31 +00:00
d393a463ff Fix functorch test selection logic (#86944)
I realize that `run_test.py` doesn't take into account functorch test selection logic at the moment, for example `python test/run_test.py --functorch -i functorch/test_ops --verbose` stills run all functorch tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86944
Approved by: https://github.com/clee2000, https://github.com/malfet
2022-10-14 17:26:52 +00:00
bbd7b38d55 Revert "symintify nll loss fns (#86915)"
This reverts commit 0ece7c86d829e2515e8b7d5df13cf0279b70c0e9.

Reverted https://github.com/pytorch/pytorch/pull/86915 on behalf of https://github.com/anjali411 due to test_autocast_nn_fp32 fails
2022-10-14 17:22:55 +00:00
0ece7c86d8 symintify nll loss fns (#86915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86915
Approved by: https://github.com/albanD
2022-10-14 17:06:56 +00:00
a86278b08c [FSDP] Consolidate FSDP state_dict offload_to_cpu settings (#86211)
Consolidate FSDP state_dict offload_to_cpu settings. All state_dict_types now
have offload_to_cpu options.

Differential Revision: [D40065969](https://our.internmc.facebook.com/intern/diff/D40065969/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86211
Approved by: https://github.com/rohan-varma
2022-10-14 16:23:28 +00:00
c9a8d309bd add super setup to test to enable disabling in test_dims.py (#86953)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86953
Approved by: https://github.com/huydhn
2022-10-14 16:04:04 +00:00
8eb579e362 Revert "[Profiler] Move legacy profiler out of torch/csrc/autograd (#85512)"
This reverts commit 157a3d2a7cd25779258f3e3dcef14633f1930103.

Reverted https://github.com/pytorch/pytorch/pull/85512 on behalf of https://github.com/DanilBaibak due to Due to files were deleted, the internal build failed. Please re-submit via codev.
2022-10-14 14:56:59 +00:00
4460e40db4 [primTorch] Add a ref for addcmul (#86731)
Based on:
https://github.com/pytorch/pytorch/pull/79827
https://github.com/pytorch/pytorch/pull/72949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86731
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-14 14:26:23 +00:00
746500d58d Revert "[cuDNN] Enable cuDNN Frontend v8 API by Default (#84948)"
This reverts commit 427e0a6b4ebc691f1fa98662d04d5c431a75107f.

Reverted https://github.com/pytorch/pytorch/pull/84948 on behalf of https://github.com/malfet due to Broke SM86 sanity
2022-10-14 14:25:51 +00:00
2cfc4cb367 Add optional recomputable_ops argument for the min cut partitioner (#86686)
`min_cut_rematerialization_partition` has a default set of hard-coded operations that are allowed to be recomputed in the backward pass.
This PR adds customization ability to this function allowing users to control the behavior by passing `recomputable_ops` instead of relying on the default setting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86686
Approved by: https://github.com/Chillee
2022-10-14 12:15:30 +00:00
fd80684784 Add nvFuser support for torch.Tensor.view (#84634)
This is an alternative to https://github.com/pytorch/pytorch/pull/83739. While PrimTorch has `view` as a reference, we would like to use nvFuser's implementation for `view` for now. Later we might transition to PrimTorch's `torch._refs.view`.

See `test_nvprims_view` for examples of things that are now sent to nvFuser. Note that nvFuser's `view` is a copy-like operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84634
Approved by: https://github.com/kevinstephano, https://github.com/mruberry
2022-10-14 12:08:02 +00:00
b48deedb77 Set up new module torch.signal.windows (#85599)
Resolves #85366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85599
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-14 11:33:32 +00:00
056cfb0464 Revert "[ONNX] Update training state logic to support ScriptedModule (#86745)"
This reverts commit 960b98128e475b15b66119f325232039799852cd.

Reverted https://github.com/pytorch/pytorch/pull/86745 on behalf of https://github.com/janeyx99 due to  960b98128e broke onnx tests on trunk
2022-10-14 05:40:20 +00:00
157a3d2a7c [Profiler] Move legacy profiler out of torch/csrc/autograd (#85512)
The legacy profiler is an eyesore in the autograd folder. At this point the implementation is almost completely decoupled from the rest of profiler, and it is in maintaince mode pending deprecation.

As a result, I'm moving it to `torch/csrc/profiler/standalone`. Unfortuantely BC requires that the symbols remain in `torch::autograd::profiler`, so I've put some basic forwarding logic in `torch/csrc/autograd/profiler.h`.

One strange bit is that `profiler_legacy.h` forward declares `torch::autograd::Node`, but doesn't seem to do anything with it. I think we can delete it, but I want to test to make sure.

(Note: this should not land until https://github.com/pytorch/torchrec/pull/595 is landed.)

Differential Revision: [D39108648](https://our.internmc.facebook.com/intern/diff/D39108648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85512
Approved by: https://github.com/aaronenyeshi
2022-10-14 05:38:48 +00:00
35fb007749 [Profiler][Minor] Separate standalone profilers from the main PyTorch profiler. (#85511)
There are a number of instrumentation utils which have been added to the profiler toolkit. They are generally small and self contained, often wrapping vendor APIs. (NVTX, ITT)

They don't really interact with the much more expansive machinery of the PyTorch profiler beyond registration / unregistration, minor util sharing, and reusing the profiler base class. Just as in the case of stubs, it makes sense to group them in a dedicated subfolder.

Differential Revision: [D39108649](https://our.internmc.facebook.com/intern/diff/D39108649/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39108649/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85511
Approved by: https://github.com/albanD
2022-10-14 05:38:48 +00:00
b8f14b7877 [Profiler][Minor] Group and consolidate stub APIs (#85510)
There is a concept in profiler of a stub that wraps a profiling API. It was introduced for CUDA profiling before Kineto, and ITT has adopted it to call into VTune APIs. However for the most part we don't really interact with them when developing the PyTorch profiler.

Thus it makes sense to unify the fallback registration mechanism and create a subfolder to free up real estate in the top level `torch/csrc/profiler` directory.

Differential Revision: [D39108647](https://our.internmc.facebook.com/intern/diff/D39108647/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85510
Approved by: https://github.com/aaronenyeshi
2022-10-14 05:38:46 +00:00
bc4ca4c2c4 [FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters (#86524)
`_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included.

Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86524
Approved by: https://github.com/rohan-varma
2022-10-14 05:19:16 +00:00
960b98128e [ONNX] Update training state logic to support ScriptedModule (#86745)
In https://github.com/pytorch/pytorch/issues/86325, it was reported that ScriptedModule do not have a training attribute and will fail export because we don't expect them as input.

Also

- Parameterized the test_util_funs test

Thanks @borisfom for the suggestion!

Fixes #86325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86745
Approved by: https://github.com/AllenTiTaiWang, https://github.com/BowenBao
2022-10-14 01:31:40 +00:00
f451e824f3 Revert " C10D extension to enable per-thread PG (#86348)"
This reverts commit 97abc21f2bda38e73de2a86da7f43c8126930681.

Reverted https://github.com/pytorch/pytorch/pull/86348 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks macos tests 97abc21f2b
2022-10-14 01:26:46 +00:00
c16c4a37ab Remove functorch copy of conftest.py (#86927)
Now that its tests have been moved to PyTorch test.  This was a left over from https://github.com/pytorch/pytorch/pull/86623

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86927
Approved by: https://github.com/clee2000
2022-10-14 00:47:16 +00:00
b3b9786fdd Unified symbolic shape variables between AOTAutograd and Inductor (#86659)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86659
Approved by: https://github.com/wconstab
2022-10-14 00:24:43 +00:00
c7c09722ad Move TorchDynamo into PyTorch core (#86461)
Context:
https://github.com/pytorch/torchdynamo/issues/1588

This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core.
- `torchdynamo` becomes `torch._dynamo`
- `torchinductor` becomes `torch._inductor`

This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461
Approved by: https://github.com/voznesenskym
2022-10-13 23:18:06 +00:00
97abc21f2b C10D extension to enable per-thread PG (#86348)
Move a bunch of globals to instance methods and replace all use to them.

We move all PG related globals under World and use a singleton instance under _world.

This creates an undocumented extension point to inject full control of how how c10d
state behaves.

One simple hack is to change _world to an implementation that uses a threadlocal
and enable per-thread PGs.

It almost get DDP working and the PG is missing an implementation of all_reduce.

This enables notebook usage of PTD, which is a big deal for learning it:
https://gist.github.com/kumpera/32cb051fa26b8cad8bdf671f968dcd68

This change ensures BC by keeping the global variables around and have the default _World wrap it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86348
Approved by: https://github.com/rohan-varma
2022-10-13 22:23:28 +00:00
66979fbfaa Improve complex lerp performance (#84844)
The complex lerp kernel uses `std::abs(z) < 0.5` which involves
computing a sqrt. Instead compare the square against 0.25 has much
lower latency and so performs much better overall.

In a simple timeit benchmark I see more than 10x speedup on CPU for a 4096
element complex lerp, from 84 us to 6.7 us.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84844
Approved by: https://github.com/ngimel
2022-10-13 21:56:37 +00:00
ae45dab57e disable failing circleci test jobs (#86940)
should revert later when fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86940
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2022-10-13 21:27:52 +00:00
974ad8fa6c Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591)
## BFloat16 dtype support for faster inference with TorchScript using oneDNN Graph

Intel Xeon Cooper Lake platform & beyond support the `AVX512_BF16` ISA, which is essentially native BFloat16 support.
oneDNN Graph delivers high inference performance with BFloat16 on such machines.

While oneDNN Graph can still be used with BFloat16 on older machines that lack `avx512_bf16` ISA but support `avx512bw`, `avx512vl` & `avx512dq` ISAs, the BF16 performance on these older machines will be significantly poorer (probably even poorer than Float32), as they lack native BF16 support.

Currently, [AMP support for eager mode & JIT mode is divergent in PyTorch](https://github.com/pytorch/pytorch/issues/75956).
So, for using oneDNN Graph with BFloat16, eager-mode AMP should be leveraged by turning off AMP for JIT mode, using `torch._C._jit_set_autocast_mode(False)` in python code, so as to avoid conflicts.

Please use the following environment variable to view JIT logs -
`PYTORCH_JIT_LOG_LEVEL=">>graph_helper:>>graph_fuser:>>kernel:>>interface"`

## Changes being made in this PR
1. This PR does NOT change the `oneDNN` commit or the `ideep` files. While the `ideep` commit is being updated, only files pertaining to oneDNN Graph are being updated. oneDNN Graph is being upgraded to version 0.5.2 (alpha patch release 2).
To put things into perspective, `ideep` is a git submodule of PyTorch. `oneDNN Graph` is a git submodule of `ideep` (`ideep/mkl-dnn`), and oneDNN is a git submodule of oneDNN Graph (`ideep/mkl-dnn/third_party/oneDNN`).
2. Unit-tests are being updated. We now use the [existing dtypes decorator](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L123-L131).
3. Suggestions made by @eellison in the [FP32 PR](https://github.com/pytorch/pytorch/pull/68111#pullrequestreview-896719477) are being incorporated/addressed -

| Action-item | Status |
| :---                                             |          ---: |
|checkInputCompatibility follow up | Fixed |
|the mayConvertScalarInputToTensor logic we can consider | Added type promotion code |
|fix up fixConvOptionalBias| The current approach seems correct |
|Use opinfo tests| using dtypes decorator. Will use `OpInfo` in a subsequent PR, if that'd be possible. Should we create a list of ops from opDB that are supported by oneDNN Graph, and add it to `common_methods_invocations.py`? |
|inferDevice torch_check call | not necessary now, perhaps, as only CPU is supported, for now? We'd add it by the beta release of oneDNN Graph, though, so that by then, users might be able to use other fusers with oneDNN Graph (NNC/TensorExpr are already compatible with the oneDNN Graph fuser). We can still add it, if you'd insist. |
|not checking shapes of input mkldnn tensor to llga guard | Those checks should not be present because oneDNN Graph may use blocked or channels-last layout, so those strides would be different. They're only skipped if an LLGA subgraph's output is input to another LLGA subgraph, which enables LLGA to choose an optimal layout between them. |
|fix test failures with respect to unsupported inputs | We'll address them with the upcoming release of oneDNN Graph beta version|

4. More PyTorch ops are being been mapped to oneDNN Graph

## Example of using oneDNN Graph with BFloat16

```python
# Assuming we have a model of the name 'model'

example_input = torch.rand(1, 3, 224, 224)

# enable oneDNN Graph
torch.jit.enable_onednn_fusion(True)
# Disable AMP for JIT
torch._C._jit_set_autocast_mode(False)
with torch.no_grad(), torch.cpu.amp.autocast():
    model = torch.jit.trace(model, (example_input))
    model = torch.jit.freeze(model)
     # 2 warm-ups (2 for tracing/scripting with an example, 3 without an example)
    model(example_input)
    model(example_input)

    # speedup would be observed in subsequent runs.
    model(example_input)
```

## TorchBench based Benchmarks
**URL:** https://github.com/sanchitintel/benchmark/tree/onednn_graph_benchmark (instructions present at URL).
**Batch-size(s):** TorchBench-default for each model
**Baseline :** PyTorch JIT OFI FP32
**Machine:** Intel(R) Xeon(R) Platinum 8371HC (Cooper Lake)
**Sockets used**: 1
**Number of cores on one socket**: 26
Intel OpenMP & tcmalloc were preloaded

#### Benchmark results with single thread
| name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[alexnet-cpu-jit]                       |      1.063851 |        0.509820 |     -52.1% |
| test_eval[mnasnet1_0-cpu-jit]                    |      0.218435 |        0.107100 |     -51.0% |
| test_eval[mobilenet_v2-cpu-jit]                  |      0.114467 |        0.058359 |     -49.0% |
| test_eval[mobilenet_v3_large-cpu-jit]            |      0.233873 |        0.117614 |     -49.7% |
| test_eval[resnet18-cpu-jit]                      |      0.160584 |        0.075854 |     -52.8% |
| test_eval[resnet50-cpu-jit]                      |      1.652846 |        0.713373 |     -56.8% |
| test_eval[resnext50_32x4d-cpu-jit]               |      0.471174 |        0.209431 |     -55.6% |
|test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.310306 | 0.167090 | -46.2% |
| test_eval[squeezenet1_1-cpu-jit]                 |      0.161247 |        0.045684 |     -71.7% |
| test_eval[timm_efficientnet-cpu-jit]             |      1.643772 |        0.800099 |     -51.3% |
| test_eval[timm_regnet-cpu-jit]                   |      5.732272 |        2.333417 |     -59.3% |
| test_eval[timm_resnest-cpu-jit]                  |      1.366464 |        0.715252 |     -47.7% |
| test_eval[timm_vision_transformer-cpu-jit]       |      0.508521 |        0.271598 |     -46.6% |
| test_eval[timm_vovnet-cpu-jit]                   |      2.756692 |        1.125033 |     -59.2% |
| test_eval[vgg16-cpu-jit]                         |      0.711533 |        0.312344 |     -56.1% |

#### Benchmark results with 26 threads:
| name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[alexnet-cpu-jit]                       |      0.062871 |        0.034198 |     -45.6% |
| test_eval[mnasnet1_0-cpu-jit]                    |      0.022490 |        0.008172 |     -63.7% |
| test_eval[mobilenet_v2-cpu-jit]                  |      0.012730 |        0.005866 |     -53.9% |
| test_eval[mobilenet_v3_large-cpu-jit]            |      0.025948 |        0.010346 |     -60.1% |
| test_eval[resnet18-cpu-jit]                      |      0.011194 |        0.005726 |     -48.9% |
| test_eval[resnet50-cpu-jit]                      |      0.124662 |        0.045599 |     -63.4% |
| test_eval[resnext50_32x4d-cpu-jit]               |      0.034737 |        0.015214 |     -56.2% |
|test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.028820 | 0.012517 | -56.6% |
| test_eval[squeezenet1_1-cpu-jit]                 |      0.012557 |        0.003876 |     -69.1% |
| test_eval[timm_efficientnet-cpu-jit]             |      0.203177 |        0.051879 |     -74.5% |
| test_eval[timm_regnet-cpu-jit]                   |      0.452050 |        0.151113 |     -66.6% |
| test_eval[timm_resnest-cpu-jit]                  |      0.117072 |        0.052848 |     -54.9% |
| test_eval[timm_vision_transformer-cpu-jit]       |      0.046048 |        0.023275 |     -49.5% |
| test_eval[timm_vovnet-cpu-jit]                   |      0.213187 |        0.077482 |     -63.7% |
| test_eval[vgg16-cpu-jit]                         |      0.044726 |        0.021998 |     -50.8% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85591
Approved by: https://github.com/jgong5, https://github.com/frank-wei, https://github.com/chunyuan-w
2022-10-13 20:36:59 +00:00
14dd5db2f5 [fsdp] Fix test for 2d parallel integration to trigger the load hooks. (#86272)
nit: replaced empty array bool test with explicit test for its length.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86272
Approved by: https://github.com/awgu
2022-10-13 20:28:44 +00:00
18f58e2df1 [quant][be] Rename node_name_to_target_dtype to node_name_to_target_dtype_info (#86860)
Summary:
att, renaming to improve readability

Test Plan:
no functionality changes

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86860
Approved by: https://github.com/jcaip
2022-10-13 20:24:05 +00:00
158a071034 add _freeze for embedding op (#86769)
Fixes #86663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86769
Approved by: https://github.com/albanD
2022-10-13 20:12:52 +00:00
e737f2d81c set the correct size of aten tensor in presence of mkldnn padding (#86767)
This fixes https://github.com/pytorch/pytorch/issues/86556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86767
Approved by: https://github.com/eellison
2022-10-13 19:35:31 +00:00
860ad04990 [ONNX] Fix FindCommonAncestor in function_extraction (#86650)
One line fix to get absolute value of `diff` before looping over.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86650
Approved by: https://github.com/AllenTiTaiWang, https://github.com/abock
2022-10-13 18:33:32 +00:00
af1dcef79c [ONNX] Fix triu/tril export with diagonal input (#86843)
Investigation with @thiagocrepaldi discovered this bug with triu/tril export when
`diagonal` is passed in as input. Previously assumption was made that `diagonal`
is always provided a constant value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86843
Approved by: https://github.com/thiagocrepaldi, https://github.com/abock
2022-10-13 18:09:37 +00:00
dbdfb8dd8b Skip test_nvfuser_extremal_values for native_batch_norm (#86897)
New tests were introduced with 68a6113248.
This PR explicitly skips the problematic tests.
Fixes https://github.com/pytorch/pytorch/issues/86176
Fixes https://github.com/pytorch/pytorch/issues/86177
Fixes https://github.com/pytorch/pytorch/issues/86178
Fixes https://github.com/pytorch/pytorch/issues/86179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86897
Approved by: https://github.com/soulitzer
2022-10-13 18:09:00 +00:00
2ce6150d23 [ONNX] Fix scalar_type_analysis metadata for copied constant (#86716)
Fix the source of metadata for copied constant. Since the constant is being implicitly casted,
it makes more sense to assign code location and etc with the user node.
This issue was discovered in https://github.com/pytorch/pytorch/issues/86627. This PR also adds unit test coverage for scope
information of nodes when they are altered by CSE and related passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86716
Approved by: https://github.com/thiagocrepaldi, https://github.com/malfet
2022-10-13 18:01:44 +00:00
4839f73f32 Fix incorrect tensor storage check (#86845)
Fix incorrect tensor storage check

This change contains an incorrect check for storage: https://github.com/pytorch/pytorch/pull/86557
**self.storage is not None**
should have been:
**not torch._C._has_storage(self)**

These fixes were run through the DirectML test suite, and confirm the check is now working correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86845
Approved by: https://github.com/martinb35, https://github.com/bdhirsh
2022-10-13 17:54:28 +00:00
afc9963865 Fix path to nested_tensor in example (#86891)
This appears to be a typo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86891
Approved by: https://github.com/H-Huang
2022-10-13 17:42:32 +00:00
54ee95c8ec [nn] module: full_backward_pre_hook (#86700)
Fixes https://github.com/pytorch/pytorch/issues/42824

* [x] Test
* [x] Doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86700
Approved by: https://github.com/soulitzer
2022-10-13 17:36:39 +00:00
7dcfbedce0 Fix LinearLR scheduler start_factor (#86695)
Fixes #86454

The `start_factor` must be comprised in ]0;1] instead of [0;1] to avoid division by 0. This PR changes the lower limit checking of the parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86695
Approved by: https://github.com/albanD
2022-10-13 17:31:36 +00:00
6ee94b572a [functorch] Add shard to run functorch tests with asan (#82164)
This adds asan testing for functorch. It was running really long (>4hrs) with test ops, so we decided that those tests are probably redundant and skipped those. This brings this test's time down to ~30 min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82164
Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/huydhn
2022-10-13 17:26:56 +00:00
427e0a6b4e [cuDNN] Enable cuDNN Frontend v8 API by Default (#84948)
#58414

Opening this PR for testing for now to check CI status. 🤞

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84948
Approved by: https://github.com/ngimel
2022-10-13 17:26:36 +00:00
b0d80f4355 [ONNX] Clarify phrasing of skipScriptTest/skipTraceTest decorators (#86216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86216
Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock
2022-10-13 17:20:35 +00:00
0ee0999608 [ONNX] Renable assert diagnostic test (#85999)
Fix to properly clear 'background_context' of export diagnostic 'engine' in `clear`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85999
Approved by: https://github.com/abock
2022-10-13 17:19:36 +00:00
cff333bdb5 Enable max.unary_out (#86855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86855
Approved by: https://github.com/jerryzh168, https://github.com/bdhirsh
2022-10-13 17:14:53 +00:00
25811663af [FSDP] restricts meta model check to non ignored modules in FSDP (#86766)
Summary: as title

Test Plan: see test plan D40287799

Differential Revision: D40287890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86766
Approved by: https://github.com/awgu
2022-10-13 16:48:24 +00:00
ab69550678 Add nested squeeze.dim and unsqueeze (#86813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86813
Approved by: https://github.com/drisspg
2022-10-13 16:05:36 +00:00
e531cf7b2e [ao] fixing public v private for fx.backend_config_utils.py (#86037)
Summary: just added a missing function to __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86037
Approved by: https://github.com/jerryzh168
2022-10-13 16:04:42 +00:00
d169f950da Revert "Use CUTLASS GEMM for NT bmm [OSS-only] (#85894)"
This reverts commit ef58a132f223d5abf2bd3f8bee380aca6c29d17f.

Reverted https://github.com/pytorch/pytorch/pull/85894 on behalf of https://github.com/DanilBaibak due to Break internal build
2022-10-13 15:28:09 +00:00
b97ae59e29 Change legacy wrap_dim to work with symint == (#86842)
- previously, sizes == vector<T>({0}) failed to hit SymInt::operator==, causing a the loop to bail out too early and make an invalid call to downstream maybe_wrap_dim helper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86842
Approved by: https://github.com/Chillee, https://github.com/malfet, https://github.com/albanD
2022-10-13 15:10:46 +00:00
3d9fd060f4 [functorch] Add more details to the functorch install page (#86823)
Added some details about:
- `pip uninstall functorch` being helpful if there are problems
- `pip install functorch` still working for BC reasons.

Test Plan:
- wait for docs preview
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86823
Approved by: https://github.com/samdow
2022-10-13 14:53:04 +00:00
cbc01c4344 OpInfo: Sample input cleanup (2/n) (#86379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86379
Approved by: https://github.com/mruberry
2022-10-13 14:50:03 +00:00
2efc56d9d7 OpInfo: Sample input cleanup (1/n) (#86231)
This rewrites various sample and error input functions to:
- use the convention of `make_arg = functools.partial(make_tensor, ...)`
- use the new natural syntax for `SampleInput` construction
- yield instead of returning a lists, to reduce memory consumption
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86231
Approved by: https://github.com/mruberry
2022-10-13 14:50:03 +00:00
45274c56a4 [ONNX] Partially re-enable RoiAlign and RoiPool unit tests (#86169)
This PR depends on https://github.com/pytorch/vision/pull/6685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86169
Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock
2022-10-13 14:39:44 +00:00
e17732b234 [test] add cross-ref tests for python meta kernels (#86228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86228
Approved by: https://github.com/albanD
2022-10-13 14:14:26 +00:00
0feccda7d7 fix aliasing bug in pixel shuffle/unshuffle (#86608)
Fixes https://github.com/pytorch/pytorch/issues/82235

cc @albanD - `at::pixel_shuffle` and `at::pixel_unshuffle` advertise as being non-aliasing, but they have a C++ decomposition that internally uses reshape(), which means that it might return an alias.

I happened to notice this because a bunch of tests in `test/test_ops.py` failed when I ran locally with a `DEBUG=1` build.

(P.S.: when are we finally gonna get a debug build test in CI? 😃)

I fixed by adding an extra clone, which... is going to be an unnecessary perf hit in the case where the `reshape()` already properly cloned the input. My hope is that this is fine, because this only impacts the composite kernel- we already have a "fast" CPU kernel that does the right thing. Is `pixel_shuffle/unshuffle` commonly used with cuda? Maybe we should just add a fast cuda kernel for it if that's the case.

Alternatively, it seems like it would be nice if `reshape()` accepted an optional argument to unconditionally return a copy. That seems like a rabbit hole that isn't worth going down for now though - I remember a discussion a while ago about making `reshape()` copy-on-write

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86608
Approved by: https://github.com/albanD
2022-10-13 14:14:26 +00:00
3376050543 fix type promotion for group_norm composite C++ kernel (#86607)
python decomp for `native_group_norm` is correct in more cases than the C++ composite. Updating the tests to fail properly in this case was more annoying than just fixing the C++ decomp, so I fixed it here.

When the input tensor had a dtype with less precision than float32, the C++ decomp would unconditionally set the mean/variance to float32, which was wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86607
Approved by: https://github.com/albanD
2022-10-13 14:14:22 +00:00
6907db3f95 fix aliasing for primtorch view meta kernels (#86285)
Fixes https://github.com/pytorch/pytorch/issues/86284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86285
Approved by: https://github.com/albanD, https://github.com/mruberry
2022-10-13 14:14:20 +00:00
77e68b16cc suggest rebasing through @pytorchbot if PR is stale (#86898)
Summary:

Test Plan: Testing on GitHub with `stale_pr_days` set to zero.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86898
Approved by: https://github.com/malfet
2022-10-13 14:04:08 +00:00
8fffb79771 Add vmap support for slogdet; fix regression from functorch 0.2.1 (#86815)
This PR adds vmap support for slogdet -- slogdet just decomposes into
linalg.slogdet.

This fixes a regression from functorch 0.2.1 (slogdet had a batching
rule then, and doesn't anymore). We didn't catch the regression because
it seems like slogdet doesn't have an OpInfo (I'm not sure if it had one
before).

Test Plan:
- new one-off test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86815
Approved by: https://github.com/samdow
2022-10-13 14:03:22 +00:00
77d94ac5ab Sets CUDA_MODULE_LOADING to LAZY when not set by the user (#85692)
This PR sets CUDA_MODULE_LOADING if it's not set by the user. By default, it sets it to "LAZY".

It was tested using the following commands:
```
python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"
```
which shows a memory usage of: 287,047,680 bytes

vs

```
CUDA_MODULE_LOADING="DEFAULT" python -c "import torch; tensor=torch.randn(20, 16, 50, 100).cuda(); free, total = torch.cuda.cudart().cudaMemGetInfo(0); print(total-free)"
```
which shows 666,632,192 bytes.

C++ implementation is needed for the libtorch users (otherwise it could have been a pure python functionality).

cc: @ptrblck @ngimel @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85692
Approved by: https://github.com/malfet
2022-10-13 14:03:01 +00:00
30a8a87c80 Fix autogen for _ctc_loss.Tensor (#86871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86871
Approved by: https://github.com/larryliu0820
2022-10-13 07:23:13 +00:00
dc6ce1485e Use Variable Size Indices in Sparse Qlinear Code (#85247)
Final changes to enable sparse weight packing with variable size indices

pack_block_sparse.cc is deleted because all functions in it have a template added, so they are moved to pack_block_sparse.h

Differential Revision: [D39025651](https://our.internmc.facebook.com/intern/diff/D39025651/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39025651/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85247
Approved by: https://github.com/digantdesai
2022-10-13 05:50:04 +00:00
d3afd49c85 Enable 16bit and 8bit Row/Col Indices in Qnnpack Fully Connected Sparse Op (#85246)
This diff enables using the 16bit and 8bit kernels added in the previous diff.

(This change used to be in D38954842 v11 but was moved into its own diff)

Differential Revision: [D39403164](https://our.internmc.facebook.com/intern/diff/D39403164/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85246
Approved by: https://github.com/kimishpatel
2022-10-13 05:46:52 +00:00
6c6e06619f Add 16bit and 8bit row/col indices q8gemm sparse kernels (#85245)
TLDR: see D39003528 to see the actual changes in this diff more clearly, which will make reviewing easier

___

The 32bit versions were changed to be created with a macros which are also used to create 16bit and 8bit versions

This diff shows that almost all of the lines in the .s files were modified, but most changes are just adding spaces to the front and ;/ to the end so they can be contained in the macro. To generate these changes, I first wrote the macros without the spaces and ;/, and then I ran a script (see the python file in D39003528) to get the final version.

To review this diff more easily, if you want to see the code changes before I ran the script, which makes it much easier to see which lines were changed, see D39003528.

Each version of this diff is synched with the same number version of that diff (so if I change this diff I will mirror the changes to the same version on that diff)

Differential Revision: [D39003527](https://our.internmc.facebook.com/intern/diff/D39003527/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85245
Approved by: https://github.com/kimishpatel
2022-10-13 05:43:51 +00:00
6c6a32c223 Enable Running Variable Size Row/Col Indices q8gemm Sparse Kernels in QNNPACK (#85244)
For aarch32 and aarch64, the 16bit and 8bit versions of the kernels are left empty. I will be adding them in a future diff (D39003527) to avoid having this diff be too cluttered.

Differential Revision: [D38954842](https://our.internmc.facebook.com/intern/diff/D38954842/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85244
Approved by: https://github.com/kimishpatel
2022-10-13 05:40:09 +00:00
4c0e1dc980 Update Qnnpack Fully Connected Sparse Op to Store Variable Size Indices (#85243)
Only uint32_t is supported for now, but uint16_t and uint8_t support will be added in future diffs.

Differential Revision: [D38828545](https://our.internmc.facebook.com/intern/diff/D38828545/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85243
Approved by: https://github.com/kimishpatel
2022-10-13 05:03:07 +00:00
1a87c25fe1 Add functorch shard to sm86-periodic workflow (#86820)
After https://github.com/pytorch/pytorch/pull/86799 was landed there shouldn't be a need to increase tolerances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86820
Approved by: https://github.com/zou3519
2022-10-13 04:25:41 +00:00
cb4867a71a Make ASGD & RProp differentiable (#86258)
Blocked by #86183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86258
Approved by: https://github.com/albanD
2022-10-13 04:06:13 +00:00
5224906749 Spread distributed backends among all distributed shards (#86837)
So that they can be run in parallel without stepping on each other toe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86837
Approved by: https://github.com/clee2000
2022-10-13 03:31:33 +00:00
48c648d75d Fix typo TORCH_ONLY_METHOD_OPERATORS -> TORCH_ASSERT_ONLY_... (#86661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86661
Approved by: https://github.com/malfet
2022-10-13 03:12:59 +00:00
67fbd940ba [ao] fixing public v private for fx.quantization_types (#86036)
Summary: this file doesn't actually exist anymore so its just a case of
removing the exception for it

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86036
Approved by: https://github.com/jerryzh168
2022-10-13 01:57:16 +00:00
b00cdb5b34 [ao] fixing public v private for quantization_patterns.py (#86034)
Summary: no significant changes, just addded __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86034
Approved by: https://github.com/jerryzh168
2022-10-13 01:57:00 +00:00
77d29bcee2 [primTorch] special: ndtr, ndtri, log_ndtr, erfcx (#86077)
- Adds prims and _refs for `erfcx` and `ndtri`.
- Adds _refs for `ndtr`, and `log_ndtr`.

cc @kshitij12345 @lezcano @mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86077
Approved by: https://github.com/mruberry
2022-10-13 01:18:30 +00:00
ea586c0579 Fix up cond a bit to make it work w/ fake tensor (#86727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86727
Approved by: https://github.com/zou3519
2022-10-13 00:54:17 +00:00
2a75152537 [easy] Add nested tanh (#86826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86826
Approved by: https://github.com/cpuhrsch
2022-10-13 00:48:08 +00:00
b79bac0e4d Make the data types of output and input consistenst for batchnorm (#84410)
The model TTS will crash due to the issue:: when input of BN is not contiguous and the data type of input is different with that of parameters, BN will raise error `RuntimeError: !needs_dynamic_casting<func_t>::check(iter) INTERNAL ASSERT FAILED at "xxx/pytorch/aten/src/ATen/native/cpu/Loops.h":311, please report a bug to PyTorch`.

Make the data types of output and input consistenst for batchnorm to fix the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84410
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2022-10-13 00:42:46 +00:00
c2f29e75cd [flakybot] add dynamo as platform (#86701)
corresponding pr in test-infra https://github.com/pytorch/test-infra/pull/874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86701
Approved by: https://github.com/huydhn
2022-10-13 00:42:40 +00:00
9470059766 Allow viable/strict promotion even if periodic or docker-release-builds jobs are failing (#86827)
Allow `viable/strict` promotion even if `periodic` or `docker-release-builds` jobs are failing

**Why?** Those jobs only run occasionally and for all we know the current viable/strict commit may already include the errors that the above cron based workflows may have later detected.  Blocking the viable/strict upgrade because of these scheduled jobs doesn't really offer any value, it just leads to people getting older PRs when they try to fork off of viable/strict without guaranteeing an improvement in test quality

Though frankly, the current situation is worse than that.

Assume the branch history looks like A -> B

A is the current `viable/strict` commit
B is a commit that failed some `periodic` test, so `viable/strict` wasn't upgraded to B

Now lets say there's a commit C that gets merged. C neither contains a fix for the failing periodic build, nor does a scheduled periodic workflow run against C. The branch becomes A -> B -> C

In the above scenario, today we will promote `viable/strict` to C since there was no failing workflow there!!! Even though it didn't actually fix what was broken with B!

In short, avoiding the upgrade to B really doesn't make any sense today and we shouldn't do it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86827
Approved by: https://github.com/janeyx99
2022-10-13 00:38:48 +00:00
66cab5245f Reland 2 min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86797)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86797
Approved by: https://github.com/bdhirsh
2022-10-13 00:31:19 +00:00
894c4218dd ci: Just use regular checkout (#86824)
checkout-pytorch seems to have issues and is purpose made for our PR
testing and appears to conflict with what we're trying to do for binary
builds.

For builds like
https://github.com/pytorch/pytorch/actions/runs/3207520052/jobs/5242479607
there is a confusion over where the reference is pulled and I believe it is
root caused by the checkout logic in checkout-pytorch.

So with that in mind I suggest we just use the upstream checkout action
for this job

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86824
Approved by: https://github.com/atalman
2022-10-13 00:24:02 +00:00
aacb9f3ac6 Make Adadelta,Adagrad & Adamax differentiable (#86096)
Continuing the differentiable optimizers support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86096
Approved by: https://github.com/janeyx99
2022-10-12 23:16:29 +00:00
e552cf1050 [DOC] Use type hints to show annotation in the docs (#79086)
Fixes #44964

Use type hints in the code to show type annotations in the parameters section of the docs.

For the parameters already documented in the docstring, but lack the type annotation, the type hints from the code are used:

| [Before](https://pytorch.org/docs/master/generated/torch.nn.AdaptiveMaxPool1d.html) | [After](https://docs-preview.pytorch.org/79086/generated/torch.nn.AdaptiveMaxPool1d.html) |
| --- | --- |
| <img width="462" alt="image" src="https://user-images.githubusercontent.com/6421097/172954756-96d2d8a6-7df9-4c0f-ad34-c12912a5a740.png"> | <img width="479" alt="image" src="https://user-images.githubusercontent.com/6421097/172954770-a6ce2425-99a6-4853-ac2c-e182c3849344.png"> |

| [Before](https://pytorch.org/docs/master/generated/torch.nn.Linear.html) | [After](https://docs-preview.pytorch.org/79086/generated/torch.nn.Linear.html) |
| --- | --- |
| <img width="482" alt="image" src="https://user-images.githubusercontent.com/6421097/172954992-10ce6b48-44a2-487e-b855-2a15a50805bb.png"> | <img width="471" alt="image" src="https://user-images.githubusercontent.com/6421097/172954839-84012ce6-bf42-432c-9226-d3e81500e72d.png"> |

Ref:
- PR https://github.com/pytorch/pytorch/pull/49294 removed type annotations from signatures in HTML docs.
- Sphinx version was bumped to 5.0.0 in PR #70309
- Duplicated (closed) issues: #78311 and #77501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79086
Approved by: https://github.com/malfet
2022-10-12 22:31:48 +00:00
a77f2a95a7 Improve NestedTensor documentation (#85186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85186
Approved by: https://github.com/cpuhrsch
2022-10-12 22:03:04 +00:00
be81f3d8d4 Revert distributed test parallelization (#86756)
Revert an old commit and resolve some conflicts

Fixes https://github.com/pytorch/pytorch/issues/86418
Fixes https://github.com/pytorch/pytorch/issues/86419
Fixes https://github.com/pytorch/pytorch/issues/86415
Fixes https://github.com/pytorch/pytorch/issues/86420
Fixes https://github.com/pytorch/pytorch/issues/86416
Fixes https://github.com/pytorch/pytorch/issues/86392
Fixes https://github.com/pytorch/pytorch/issues/86391
Fixes https://github.com/pytorch/pytorch/issues/86397
Fixes https://github.com/pytorch/pytorch/issues/86390
Fixes https://github.com/pytorch/pytorch/issues/86398
Fixes https://github.com/pytorch/pytorch/issues/86396
Fixes https://github.com/pytorch/pytorch/issues/86395
Fixes https://github.com/pytorch/pytorch/issues/86393
Fixes https://github.com/pytorch/pytorch/issues/86394
Fixes https://github.com/pytorch/pytorch/issues/86440
Fixes https://github.com/pytorch/pytorch/issues/86442
Fixes https://github.com/pytorch/pytorch/issues/86439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86756
Approved by: https://github.com/mrshenli
2022-10-12 21:17:28 +00:00
09a676f639 Add hooks for register_buffer/module/parameter (#86148)
As described in the issue, this PR adds hooks to be run when `register_parameter`, `register_buffer` and `register_module` are called.

Fixes #85837

cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86148
Approved by: https://github.com/albanD
2022-10-12 20:57:22 +00:00
c08cbfccd9 Let retried jobs advance viable/strict (#86821)
Today, even if we retry a failed workflow it succeeds on the retry, viable/strict doesn't advance forward.

Success on retry is proof that the error wasn't with the current commit and that we should in fact promote viable/strict. This PR points to an updated rockset query which will only look at the success status of the most recent job in each workflow

Here's the query edited:

Original query:
https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query/versions/15aba20837ae9d75?tab=sql

Updated query: https://console.rockset.com/lambdas/details/commons.commit_jobs_batch_query/versions/8003fdfd18b64696?tab=sql

Testing:
Tested the old and new query against commits known to have succeeded on retry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86821
Approved by: https://github.com/huydhn, https://github.com/malfet
2022-10-12 20:43:42 +00:00
3b26680222 Update _torch_docs / ldexp (#86721)
Fixes a typo on ldexp docstring.

https://pytorch.org/docs/master/generated/torch.ldexp.html?highlight=ldexp#torch.ldexp

<img width="976" alt="image" src="https://user-images.githubusercontent.com/2459423/195191117-15b4e1f3-dfd5-466c-b5aa-72851f0c2393.png">

https://livesphinx.herokuapp.com/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86721
Approved by: https://github.com/samdow
2022-10-12 20:33:14 +00:00
363b108e39 [quant][fx] Fix weight_dtype and bias_dtype backend_config checks (#86719)
Summary:
This PR adds checks for the existence of "weight_dtype" and "bias_dtype" in the node_name_to_dtype dictionary before accessing it,
the corner case is hit when we check the compatibility of qconfig and backend_config for weight and bias that appears before activation (e.g. torch.addmm)

Test Plan:
python test/test_quantization.py -k test_backend_config_check_for_weight_and_bias

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86719
Approved by: https://github.com/andrewor14
2022-10-12 20:20:02 +00:00
d6bfbdf50c [ao] fixing public v private for fx.pattern_utils.py (#86033)
Summary: added __all__, one issue with QuantizeHandler is that since its
defined as 'Any' it can't be set as a public module although it should
be, i've set it to private here but when the circular dependency gets
fixed, it will probably be removed.

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86033
Approved by: https://github.com/jerryzh168
2022-10-12 20:06:30 +00:00
bf0116d1f0 [ao] fixing public v private for fx.graph_module.py (#86032)
Summary: no significant changes, just added __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86032
Approved by: https://github.com/jerryzh168
2022-10-12 20:06:30 +00:00
25476f2e4b [ao] fixing public v private for quantization_types (#86031)
Summary: the main problem with this was that the different objects
defined simply as 'Any' should theoretically be public but making them
public either A) results in an error about the module being 'typing'
rather than whatever module it should be or B) you set the module
manually, thereby changing the module for the original 'Any' class.

note: QuantizeHandler has a similar issue where its simply defined as
'Any'

Pattern was defined in multiple places which was causing issues so i just moved it to a single
place given the note at the top of quantization_types.py indicating
these definitions should be moved to utils at some point anyway.

Finally i changed any references to these objects to point at the
correct locations. Note: i didn't see any fb internal references to
NodePattern or QuantizerCls that would cause issues.

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86031
Approved by: https://github.com/jerryzh168
2022-10-12 20:06:30 +00:00
ef58a132f2 Use CUTLASS GEMM for NT bmm [OSS-only] (#85894)
OSS-only copy of https://github.com/pytorch/pytorch/pull/85710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894
Approved by: https://github.com/drisspg
2022-10-12 20:03:28 +00:00
73c43ce2e2 Display unexpected exceptions raised from test_dtypes (#86599)
Currently `test_dtypes` swallows all exceptions which can make debugging failures more tricky.
This changes the test to save the exceptions and print only the unexpected ones at the end e.g.
```
AssertionError: The supported dtypes for nn.functional._scaled_dot_product_attention on device type cuda are incorrect!
The following dtypes did not work in backward but are listed by the OpInfo: {torch.bfloat16}.
Unexpected failures raised the following errors:
torch.bfloat16 - CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling [...]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86599
Approved by: https://github.com/mruberry
2022-10-12 19:51:58 +00:00
6be9d9a630 Add AutocastHPU support (#84927)
New dispatch key and necessary functions are added to PyTorch. Backend implementation will be added in the external library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84927
Approved by: https://github.com/bdhirsh
2022-10-12 19:37:16 +00:00
553eaaba7c Disable tf32 in functorch transform tests (#86799)
This PR applies a large hammer and disables TF32 in specific functorch transform tests. TF32 isn't precise enough to test correctness.

We could have applied a smaller hammer by disabling TF32 per-OpInfo, but that doesn't seem to have too much additional benefit (e.g. if a convolution batching rule is correct on fp32 then I would expect it to be correct under TF32 modulo precision issues because the actual sequence of PyTorch operators we invoke has not changed, only the backend did).

Test Plan:
- I tested this locally on a machine with A100 GPUs.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86799
Approved by: https://github.com/malfet
2022-10-12 19:27:17 +00:00
d56017a14f [primTorch] Add ref for triplet_margin_loss, improve triplet_margin_with_distance_loss (#85614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85614
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-12 18:37:58 +00:00
ce56ee11fd Extend torch.cuda.is_available() to attempt an NVML-based CUDA availability assessment when explicitly requested by the user (#85951)
Fixes #83973 (This is a substitute PR for https://github.com/pytorch/pytorch/pull/85024)

First of all, thanks for your invaluable contributions to PyTorch everyone!

Given how extensively `torch.cuda.is_available` is used in the PyTorch ecosystem, IMHO it's worthwhile to provide downstream libraries/frameworks/users the ability to alter the default behavior of `torch.cuda.is_available` in the context of their PyTorch usage.

I'm confident there are many current and future such use cases which could benefit from leveraging a weakened, NVML-based `torch.cuda.is_available` assessment at a downstream framework's explicit direction (thanks @malfet 81da50a972 !). Though one could always patch out the `torch.cuda.is_available` function with another implementation in a downstream library, I think this environmental variable based configuration option is more convenient and the cost to including the option is quite low.

As discussed in https://github.com/pytorch/pytorch/pull/85024#issuecomment-1261542045, this PR gates new non-default NVML-based CUDA behavior with an environmental variable (PYTORCH_NVML_BASED_CUDA_CHK) that allows a user/framework to invoke non-default, NVML-based `is_available()` assessments if desired.

Thanks again for your work everyone!
@ngimel @malfet @awaelchli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85951
Approved by: https://github.com/ngimel
2022-10-12 18:37:50 +00:00
cd7c86eaa4 Add prims.clone (#86705)
This simple PR adds `clone` as a primitive.
Current implementation of `clone` is not supported with nvFuser executor because of `empty_like` + `copy_to`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86705
Approved by: https://github.com/mruberry
2022-10-12 18:22:00 +00:00
3356d0385f [BE] Store helper functions C++ for python API parity (#82136)
Add helper functions for `store.set()`, `store.compare_set()` to accept string arguments instead of vector<uint_8> and refactored some usages internally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82136
Approved by: https://github.com/rohan-varma
2022-10-12 17:49:38 +00:00
cc7ea93c2c [ONNX] Support device().type() string comparison with constant (#86168)
Fixes #86168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86168
Approved by: https://github.com/justinchuby, https://github.com/AllenTiTaiWang, https://github.com/abock
2022-10-12 17:25:45 +00:00
58542eb256 [ao] fixing public v private for backend_config.native.py (#86030)
Summary: no significant changes, just added some things to __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86030
Approved by: https://github.com/jerryzh168
2022-10-12 16:06:42 +00:00
409efebab8 Added define to fix issue with compatibility with latest Windows SDK (#85408)
Fixes #83820.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85408
Approved by: https://github.com/ezyang
2022-10-12 15:44:28 +00:00
f24d174fff Allow PrivateUse1 backends to not have Storage (#86557)
Allow PrivateUse1 backends to not have Storage

To unblock the DirectML backend, this change would be needed for 1.13 as well.

The DirectML backend creates tensors using the open registration pattern documented here: https://pytorch.org/tutorials/advanced/extend_dispatcher.html
[registration example](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fbdhirsh%2Fpytorch_open_registration_example&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ivYLNmuC1WMitwu8n%2B1RAmeKkRM4ssb7EvhhGKJDFwk%3D&reserved=0)

However, DirectML tensors are opaque, and do not have Storage.
The DirectML Tensor Impl derives from OpaqueTensorImpl, which does not have a storage. Because of this various places in the code fail that expect storage to be present. We had made various changes in-tree to accommodate this:
a.	def __deepcopy__(self, memo):
[b5acba8895/torch/_tensor.py (L119)](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Ftorch%2F_tensor.py%23L119&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ajg23nMCzgRDwlinqSxS%2BRmOkAcDCr3LW%2BBEfNCn5hw%3D&reserved=0)
or self.device.type in ["lazy", "xla", "mps", "ort", "meta", "hpu", 'dml']
b.	def _reduce_ex_internal(self, proto):
[b5acba8895/torch/_tensor.py (L275)](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Ftorch%2F_tensor.py%23L275&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xDW6LwPSe2F396OJ6QSJY6mVzJVDeQiJgA0G347y2pw%3D&reserved=0)
if self.device.type in ["xla", "ort", "hpu", "dml"]:
c.	TensorIteratorBase::build has an unsupported list for tensors without storage.
[b5acba8895/aten/src/ATen/TensorIterator.cpp (L1497)](https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fblob%2Fb5acba88959698d35cb548c78dd3fb151f85f28b%2Faten%2Fsrc%2FATen%2FTensorIterator.cpp%23L1497&data=05%7C01%7CSheil.Kumar%40microsoft.com%7Cf107b0b4349e41f1a57808daa7ee8a2c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638006940242882444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qAdgNgzKl0xrtOvsABpw1VGkSoGUpe7jwDPhHw3XjgU%3D&reserved=0)

Using the PrivateUse1 backend, similar exemptions need to be made in order to relax requirements on Storage so that the DirectML backend tensors can work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86557
Approved by: https://github.com/bdhirsh, https://github.com/martinb35
2022-10-12 15:26:29 +00:00
61a5898675 use cff standard for citation information (#86200)
GH picks up on our `CITATION` file in the root of the repository.

![Screenshot from 2022-10-04 11-34-54](https://user-images.githubusercontent.com/6849766/193811617-b71ef606-a043-498b-bb2d-14b6c05e79e7.png)

However, [the preferred way](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files) is use a `CITATION.cff` file instead since GH supports the [citation file format (CFF) standard](https://github.com/citation-file-format/citation-file-format). With this PR, the prompt changes to

![Screenshot from 2022-10-04 13-48-21](https://user-images.githubusercontent.com/6849766/193812010-026bfad7-7c4e-4b59-a90a-1d3ad47303d0.png)

with the following auto-generated bibtex entry:

```bibtex
@inproceedings{Paszke_PyTorch_An_Imperative_2019,
author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
booktitle = {Advances in Neural Information Processing Systems 32},
pages = {8024--8035},
publisher = {Curran Associates, Inc.},
title = {{PyTorch: An Imperative Style, High-Performance Deep Learning Library}},
url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf},
year = {2019}
}
```

Comparing with what we currently have the only significant difference is that the editors are no longer listed although the metadata is there. This is an issue with GH's automatic conversion and might be fixed in the future. Plus, the cite key was changed from `NEURIPS2019_9015` to `Paszke_PyTorch_An_Imperative_2019`, but this has no effect on the rendered result.

Do we also want to adopt the CFF standard?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86200
Approved by: https://github.com/dagitses
2022-10-12 13:03:48 +00:00
493ded249e [primTorch] decomposition for bucketize (#86366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86366
Approved by: https://github.com/mruberry
2022-10-12 12:25:42 +00:00
f903f1ab34 Patching getitem in partitioner (#86713)
1. rejecting getitem operator in backends fusion query getitem is merged in a special post partition pass, backends that takes getitem shouldn't affect the logic
2. added test for failing cases

Fixes #86698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86713
Approved by: https://github.com/SherlockNoMad
2022-10-12 07:50:46 +00:00
2344135179 [primTorch] special: entr, expit (#86592)
Add _refs for `entr` & `expit`.

cc @mruberry @kshitij12345!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86592
Approved by: https://github.com/mruberry
2022-10-12 07:00:40 +00:00
a47f93b6c9 Add type and shape annotation for gm.print_readable() (#86562)
For
```
def f(a, b):
    dim0 = a.shape[0] + b.shape[0]
    dim1 = a.shape[1] + b.shape[1]
    d = a.new_empty(dim0, dim1)
    return d

fx_g = make_fx(f, tracing_mode="symbolic")(torch.randn(5, 3), torch.randn(4, 3))
fx_g.print_readable()
```

Tracing with 'real' and 'fake' mode yields
```
class f(torch.nn.Module):
    def forward(self, a_1: Tensor<f32>[5, 3], b_1: Tensor<f32>[4, 3]):

        # No stacktrace found for following nodes
        new_empty: Tensor<f32>[9, 6] = torch.ops.aten.new_empty.default(a_1, [9, 6], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False);  a_1 = None
        return new_empty
```

Tracing with 'symbolic' mode yields
```
    def forward(self, a_1: Tensor<f32>[t0.size(0), t0.size(1)], b_1: Tensor<f32>[t1.size(0), t0.size(1)]):

        # No stacktrace found for following nodes
        sym_size: Symint(t0.size(0)) = torch.ops.aten.sym_size(a_1, 0)
        sym_size_1: Symint(t1.size(0)) = torch.ops.aten.sym_size(b_1, 0)
        add: Symint(t0.size(0) + t1.size(0)) = sym_size + sym_size_1;  sym_size = sym_size_1 = None
        sym_size_2: Symint(t0.size(1)) = torch.ops.aten.sym_size(a_1, 1)
        sym_size_3: Symint(t0.size(1)) = torch.ops.aten.sym_size(b_1, 1);  b_1 = None
        add_1: Symint(2*t0.size(1)) = sym_size_2 + sym_size_3;  sym_size_2 = sym_size_3 = None
        new_empty: Tensor<f32>[t0.size(0) + t1.size(0), 2*t0.size(1)] = torch.ops.aten.new_empty.default(a_1, [add, add_1], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False);  a_1 = add = add_1 = None
        return new_empty
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86562
Approved by: https://github.com/Chillee
2022-10-12 05:39:54 +00:00
e0d6898cbd Revert "Backport currently dont work with some models if: (#86510)"
This reverts commit 4bfb7341819b3bfcaf65ddc136f25d23983740a7.

Reverted https://github.com/pytorch/pytorch/pull/86510 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-10-12 04:12:43 +00:00
25725fd624 (Re-open) Adds cudaMallocAsync as an alternative backend for the CUDA allocator (#82682)
Rebased version of @mcarilli 's cudaMallocAsync #65365 for continued testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82682
Approved by: https://github.com/ngimel
2022-10-12 03:44:21 +00:00
a216f4700c Add testing on A10G GPU to periodic workflow (#85524)
This enables testing on lots of modern CUDA features on sm_86 capable GPU

While migrating to that platform, discovered that `functorch` tests for `nn.functional.conv.transpose3d` produce garbage on sm_80+ as well as 2 `nvfuser` tests unexpectedly pass and one unexpectedly fails.

TODO:
 - Investigate unexpected success for `test_vmapvjp_linalg_householder_product_cuda_float32` and add `functorch` shard

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85524
Approved by: https://github.com/ngimel
2022-10-12 01:48:24 +00:00
c4f0b93f86 Disable autocast in aot autograd (#86515)
Fix for https://github.com/pytorch/torchdynamo/issues/1368

From comment:
> When we invoke a Composite Implicit autograd operator that has an autocast rule, such as Einsum,
autocast is disabled during its invocation. When we trace out the operators in an implicit op,
re-applying on autocast rules on those operators might yield divergence from what was executed at runtime.
This pass checks for divergence. If divergence is found, we will disable autocast.
We would like to avoid disabling autocast if possible because accessing TLS is slow.

Concretely, the problem found was when invoked `sum` in `einsum`:

As seen by the following divergence:
```
>>> with torch.cuda.amp.autocast(enabled=True):
...     print(torch.ops.aten.sum.dim_IntList(torch.rand([2, 2, 2], device="cuda", dtype=torch.half), [1, 2]).dtype)
...
torch.float32
>>> print(torch.ops.aten.sum.dim_IntList(torch.rand([2, 2, 2], device="cuda", dtype=torch.half), [1, 2]).dtype)
torch.float16
```

Edit: we've decided to accept the overhead of universally disabling autocast instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86515
Approved by: https://github.com/bdhirsh, https://github.com/Chillee
2022-10-12 01:43:35 +00:00
d598290baa Basic SDP benchmark harness (#86729)
Basic benchmark for reference and discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86729
Approved by: https://github.com/drisspg
2022-10-12 01:27:59 +00:00
4bfb734181 Backport currently dont work with some models if: (#86510)
Backport currently dont work with some models if:

* model is originally exported with interface call enabled (backport would disable it)
* model is flatbuffer (flatbuffer support is soft enabled via link time registry), so we manually trigger it

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86510
Approved by: https://github.com/cccclai
2022-10-12 00:39:25 +00:00
ce48df9e93 Re-enable torchdynamo unit tests (#86658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86658
Approved by: https://github.com/jansel
2022-10-12 00:37:14 +00:00
692b525b71 [MPS] Extend unary ops to int64 (#86615)
Most of them are already supported for `int64` except for:
 - rounding operations (`floor`, `ceil` and `round`), which are no-ops for integral types anyway
 - sign operation, when it can be emulated by clamping it tensor to [-1, 1] range

Test new types by test MPS

Fixes https://github.com/pytorch/pytorch/issues/86319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86615
Approved by: https://github.com/DenisVieriu97, https://github.com/huydhn
2022-10-12 00:32:53 +00:00
f912b58544 Revert "Enable max.unary_out (#85926)"
This reverts commit 16a0fa1204edb118800261a26281e624988eb239.

Reverted https://github.com/pytorch/pytorch/pull/85926 on behalf of https://github.com/osalpekar due to The internal diff for this commit shows a number of pytorch quantization test failures. Here is a sample output: AssertionError: Tensor-likes are not close! Mismatched elements: 319 / 320 (99.7%). Greatest absolute difference: 0.056652069091796875 at index (0, 0, 4, 5) (up to 1e-05 allowed). Link to the diff: [D40232598](https://www.internalfb.com/diff/D40232598). Link to the Sandcastle job that is failing: https://www.internalfb.com/intern/sandcastle/job/18014399302908587/
2022-10-11 23:53:12 +00:00
2aa981ab74 Revert "Reland 2 of Merge more symbolic meta kernels and symint changes from branch (#86334) (#86488)"
This reverts commit 978b46d7c96627e3b3553ad70ad21cb161d05f90.

Reverted https://github.com/pytorch/pytorch/pull/86488 on behalf of https://github.com/osalpekar due to Broke executorch builds internally with the following message: RuntimeError: Missing out variant for functional op: aten::split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[] . Make sure you have loaded your custom_ops_generated_lib
2022-10-11 23:39:50 +00:00
9eb4f9dd17 Tweak test tolerances to be compatible with A10G (#86538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86538
Approved by: https://github.com/ngimel
2022-10-11 23:31:48 +00:00
7fa601b1a7 Skip chalf.mean in test_reductions_large_half_tensors (#86747)
As `mean_reduce` is not implemented for complex half

Fixes https://github.com/pytorch/pytorch/issues/86743 and unblock A10G testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86747
Approved by: https://github.com/ngimel
2022-10-11 23:27:30 +00:00
811b8e012b Revert "min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86643)"
This reverts commit 86f914e9966e91b3d3e7c1504f5b1f00a9498d88.

Reverted https://github.com/pytorch/pytorch/pull/86643 on behalf of https://github.com/osalpekar due to Need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/86488. This should be safe to re-land later
2022-10-11 23:12:40 +00:00
f1fdb6efbd Manual changes for moving dynamo to core (#86621)
This is the subset of the changes in #86461 not auto-generated by `copy_to_core.sh`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86621
Approved by: https://github.com/albanD
2022-10-11 23:01:21 +00:00
09364f4298 Compile C10 with Wshadow (#86666)
This should prevent further regressions like https://github.com/pytorch/pytorch/pull/86646
Update `fmt` to `7.1.0` to fix variable shadowing in that library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86666
Approved by: https://github.com/seemethere
2022-10-11 22:39:58 +00:00
0337f0ad47 Add error checking to flaky test bot platform parser (#86632)
If an invalid platform is specified when disabling a test with flaky test bot, the CI crashes, skipping all tests that come after it.

This turns it into a console message instead.  Not erroring out here since it'll affect random PRs.  Actual error message should go into the bot that parses the original issue so that it can respond on that issue directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86632
Approved by: https://github.com/huydhn
2022-10-11 21:56:01 +00:00
42bd275233 [doc] LR scheduler example fix (#86629)
Fixes issue #86208
As suggested in the issue, updated the LR scheduler example to use a regular nn.Module like the other examples on the same page.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86629
Approved by: https://github.com/soulitzer
2022-10-11 21:41:50 +00:00
32152ce328 Add original sources/references to Wishart.py in distributions (#86543)
@fritzo As discussed, add original sources/references to Wishart.py in distributions and corrected typos in the error messages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86543
Approved by: https://github.com/fritzo
2022-10-11 21:21:53 +00:00
50af1ace5e Mark aten ops as canonical (#86215)
This is the first batch of canonical aten ops. 87 in total. More to come in the future PRs.

native_dropout
abs
add.Tensor
add.Scalar
arange.start_step
bitwise_not
bmm
cat
clamp
constant_pad_nd
convolution
convolution_backward
div.Tensor
div.Scalar
embedding_dense_backward
erf
exp
expand
fill.Scalar
grid_sampler_2d
native_group_norm
native_group_norm_backward
native_layer_norm
native_layer_norm_backward
log
_log_softmax
max.dim
amax
mean.dim
min.dim
amin
mm
mul.Tensor
mul.Scalar
native_batch_norm
permute
scalar_tensor
reciprocal
neg
repeat
relu
gelu
rsqrt
sigmoid
slice.Tensor
slice_scatter
_softmax
squeeze.dim
sum.dim_IntList
sqrt
tanh
unsqueeze
var.dim
where.self
clone
sub.Tensor
sub.Scalar
addmm
_to_copy
view
scatter_add
bitwise_and.Tensor
bitwise_or.Tensor
eq.Scalar
ge.Scalar
le.Scalar
gt.Scalar
lt.Scalar
index_select
nonzero
gather
maximum
minimum
pow.Tensor_Scalar
hardtanh
leaky_relu
_adaptive_avg_pool2d
_adaptive_avg_pool2d_backward
avg_pool2d
avg_pool2d_backward
max_pool2d_with_indices
max_pool2d_with_indices_backward
upsample_bilinear2d.vec
upsample_bilinear2d_backward.vec
upsample_nearest2d.vec
upsample_nearest2d_backward.vec
col2im

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86215
Approved by: https://github.com/suo, https://github.com/anjali411
2022-10-11 21:12:53 +00:00
8db30255c3 [ROCm] set nvfuser default to disabled, keep CI (#86369)
Bug fix. nvfuser is functional for ROCm on gfx906, but some tests are failing for other gfx targets. Disable nvfuser until all features are verified. Users may still opt-in by setting the known env var PYTORCH_JIT_ENABLE_NVFUSER=1. This PR sets this env var for the github actions workflow for ROCm since all current CI hosts are gfx906.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86369
Approved by: https://github.com/huydhn
2022-10-11 20:55:58 +00:00
5ffe24fca4 [vulkan][ez] fix always printing out a warning when retrieving the global context (#86697)
Summary: D40151818 (82ed5ca340) replaces the `TORCH_CHECK` with a `TORCH_WARN` but since it does not check if the context is valid the message gets printed every time. This diff fixes that.

Test Plan:
Referring to [Pytorch Vulkan Testing Procedures](https://fb.quip.com/fZALAc9zhlcU)

On Mac:
1. `vulkan_api_test` on Mac
2. model comparison binary on Mac

On Android:
1. `vulkan_api_test` on Android
2. benchmark binary on Android

Reviewed By: salilsdesai

Differential Revision: D40266820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86697
Approved by: https://github.com/kirklandsign
2022-10-11 20:16:56 +00:00
f32aeeae00 Set interface_call to true be default (#86668)
Summary: ASR models need it

Test Plan: existing unit tests

Reviewed By: cccclai

Differential Revision: D40251788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86668
Approved by: https://github.com/cccclai
2022-10-11 20:07:58 +00:00
7f02f2ac0c [Experimentation] Add TSAN build and test (#85313)
Some parts of the PR are adopted from the previously abandoned https://github.com/pytorch/pytorch/pull/36694.  This PR is the first part to setup TSAN jobs in the CI.  The data race warnings from TSAN will need to be reviewed later in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85313
Approved by: https://github.com/osalpekar
2022-10-11 19:34:44 +00:00
92562046e9 Optimize __dlpack_device__ performance (#86665)
This can be critical when processing a large number of tensors

```bash
python -m timeit --setup 'import torch; t = torch.empty(1000, device="cuda")' 't.__dlpack_device__()'
```

based on 1.12.1:
before:
100000 loops, best of 5: 2.32 usec per loop
after:
500000 loops, best of 5: 844 nsec per loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86665
Approved by: https://github.com/SunDoge, https://github.com/soulitzer
2022-10-11 19:03:46 +00:00
c12f829cce [nn] Add remove_duplicate flag to named_buffers (#674) (#85903)
Summary:
X-link: https://github.com/pytorch/torchrec/pull/674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84984

this is to allow named_buffers to return the same buffer objects with different names multiple times, needed by internal use cases
ghstack-source-id: 168589597

Test Plan:
python test/test_nn.py -k test_buffers_and_named_buffers

Imported from OSS

Reviewed By: albanD

Differential Revision: D39493161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85903
Approved by: https://github.com/albanD
2022-10-11 18:49:09 +00:00
693250ac85 Docs: fx.Node docs incorrectly state that the self argument is included in args for module calls (#86685)
It seems like the [torch.fx.Node docs](https://pytorch.org/docs/stable/fx.html#torch.fx.Node) are incorrect regarding the inclusion of the self argument for module call nodes.
While the docs state that self (the module) is included in `args`, it is in fact not, as demonstrated by this code:
```python
import torch
from torch import fx, nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.submod = nn.Linear(10, 10)
    def forward(self, x):
        x = x.flatten()
        return self.submod(x)

graph_module = fx.symbolic_trace(Net())
print(graph_module.graph)  # doesn't show self for the submodule call
submod_node = list(graph_module.graph.nodes)[2]
print(submod_node.op)  # call_module
print(submod_node.args)  # (flatten,) => would need to have len 2 if self was included

flatten_node = list(graph_module.graph.nodes)[1]
print(flatten_node.op)  # call_method
print(flatten_node.args)  # (x,) => here self is included (and docs are correct)
```

Since [torch.fx.Interpreter also uses `args` as if self was is not included](2fe5808590/torch/fx/interpreter.py (L288)), I assume the docs are incorrect.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86685
Approved by: https://github.com/soulitzer
2022-10-11 18:05:56 +00:00
160118d72a Add test case for matrix multiply-add with large inputs (#85550)
Summary:
- Added test case for addmm, baddbmm and linear with large inputs
- Testing with torch types: float32, float16, bfloat16

Test Plan:
Run unit tests with:
`buck2 run mode/opt //caffe2/test:linalg_re_cuda`

```
...
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_1000_10000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_1000_1000_1000_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float16 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_2_100_100_100_cpu_float32 (test_linalg_re_cuda.TestLinalgReCudaCPU) ... skipped 'Only runs on cuda'
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_10000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_1_10000_1000_10000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_1000_1000_1000_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_bfloat16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float16 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok
test_addmm_baddbmm_large_input_2_100_100_100_cuda_float32 (test_linalg_re_cuda.TestLinalgReCudaCUDA) ... ok

----------------------------------------------------------------------
Ran 24 tests in 63.224s

OK (skipped=12)
```

Differential Revision: D39718256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85550
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
2022-10-11 17:52:21 +00:00
212fa874ce Fix torch histogramdd docstring (#86593)
Fixed torch histogramdd docsting with missing common_args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86593
Approved by: https://github.com/soulitzer
2022-10-11 17:52:18 +00:00
f26292d91e [BE] Fix python docs typos up till torch.chunk (#86642)
Was doing the Views lab linked https://github.com/pytorch/pytorch/wiki/Tensor-and-Operator-Basics and noticed a few typos, which led to this PR.

Test plan:
verified in preview
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86642
Approved by: https://github.com/soulitzer
2022-10-11 17:42:53 +00:00
86f914e996 min/max support for SymInt/Floats, finish as_strided/scatter/squeeze() backward symint support (#86643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86643
Approved by: https://github.com/anjali411
2022-10-11 17:37:30 +00:00
6923dc3b59 Add module: decompositions as an owner to test_decomp.py (#86703)
so flaky tests can be attributed to @SherlockNoMad too 😛
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86703
Approved by: https://github.com/albanD
2022-10-11 17:23:36 +00:00
109f4d4453 Move functorch tests from functorch/test/* to test/functorch/* (#86623)
This is the first step described in
https://github.com/pytorch/pytorch/issues/86618 . test/functorch/* is
the final location for these tests.

Test Plan:
- Check that the functorch shards in CI are still running tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86623
Approved by: https://github.com/huydhn
2022-10-11 17:20:45 +00:00
51ea441862 Upcast to fp32 in test_addmm_block ref_half_bfloat16 (#86682)
Fixes https://github.com/pytorch/pytorch/issues/86681
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86682
Approved by: https://github.com/nikitaved
2022-10-11 16:39:57 +00:00
3edf79dc03 Revert "Add meta support for _adaptive_avg_pool2d_backward (#86359)"
This reverts commit a56a8c0fc0251bb4cd24b366a290db2e4beea747.

Reverted https://github.com/pytorch/pytorch/pull/86359 on behalf of https://github.com/clee2000 due to causing unexpected success for functorch on master but PR is green (landrace?) https://github.com/pytorch/pytorch/actions/runs/3227306657/jobs/5282180524 a56a8c0fc0
2022-10-11 16:33:41 +00:00
97de281176 Improve interpolate() speed for channels_last CPU images and masks (#86361)
This PR improves the speed of `interpolate()`:
- on CPU
-  on images and masks (`num_channels < 4`, `channels_last=True`)
- for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
- for both upsampling and downsampling

The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

```
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
```

An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
Thanks a ton @fmassa for the help!

### Speedup benchmarks:

Results:

<details>

```
----------------------------------------------------------------------------------------------------
(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
----------------------------------------------------------------------------------------------------
(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
----------------------------------------------------------------------------------------------------
(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
----------------------------------------------------------------------------------------------------
(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
----------------------------------------------------------------------------------------------------
(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
----------------------------------------------------------------------------------------------------
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
----------------------------------------------------------------------------------------------------
(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

```
</details>

Code:

<details>

I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((224, 224), (128, 128)),
        ((600, 400), (224, 224)),
        ((320, 320), (256, 256)),
        ((800, 800), (500, 500)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

with

```
for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
```

and this very ugly helper

```py
import re
with open("main") as f:
    main = f.readlines()

with open("new") as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 10 == 0 and i % 40 != 0:
        print()
    if i % 40 == 0:
        print("-" * 100)
    print(l)

```

</details>

Closes https://github.com/pytorch/pytorch/issues/83840

When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa
2022-10-11 16:17:36 +00:00
a4ee6956ff Pin numpy version during MPS tests (#86691)
numpy-1.23.1 for some reason can not be loaded on M1

Fixes https://github.com/pytorch/pytorch/issues/86688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86691
Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/seemethere
2022-10-11 16:11:47 +00:00
eqy
352d926482 [CUBLAS][CUDA GRAPHS] (re-re-re-re-open of #83461) Explicitly set the workspace for cuBLAS handles (#86645)
re-opening (again) in hopes of working around failed/stuck CLA check

CC @ptrblck @ngimel @huydhn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86645
Approved by: https://github.com/zdevito
2022-10-11 16:03:49 +00:00
937d677d9f Add version selector back to functorch docs (#86602)
I accidentally deleted it in
https://github.com/pytorch/pytorch/pull/85856/ . This brings the version
selector back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86602
Approved by: https://github.com/samdow
2022-10-11 14:49:42 +00:00
a56a8c0fc0 Add meta support for _adaptive_avg_pool2d_backward (#86359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86359
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-10-11 13:37:25 +00:00
03d8ab4dec Skip forward AD tests for torch.native_batch_norm (#86206)
`test_forward_mode_AD` has problems with `torch.native_batch_norm` when computing Jacobian using finite-differences. Weirdly this test unexpectedly passed on periodic CI. Let's skip this test instead of xfailing.
Fixes https://github.com/pytorch/pytorch/issues/86175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86206
Approved by: https://github.com/soulitzer
2022-10-11 13:03:20 +00:00
6ab07febce [FSDP][Easy] Rename _prefixed_param_names -> _fqns for consistency (#86653)
This renames `_prefixed_param_names` to `_fqns` to help converge on the terminology.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86653
Approved by: https://github.com/rohan-varma
2022-10-11 12:49:45 +00:00
2fe5808590 Symintify NLL loss, copy and squeeze (#86606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86606
Approved by: https://github.com/anjali411
2022-10-11 12:00:40 +00:00
be8627827e More symintification of get/set item (#86605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86605
Approved by: https://github.com/anjali411
2022-10-11 12:00:40 +00:00
f841442252 symintify autograd view chaining (#86604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86604
Approved by: https://github.com/anjali411
2022-10-11 12:00:38 +00:00
49c9b0a154 symintify einsum (#86603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86603
Approved by: https://github.com/anjali411
2022-10-11 12:00:35 +00:00
3a2cfbb813 Revert "Improve interpolate() speed for channels_last images and masks (#86361)"
This reverts commit 93b2d991581db86074dd8011fdc903bd554466b1.

Reverted https://github.com/pytorch/pytorch/pull/86361 on behalf of https://github.com/DanilBaibak due to Break the internal import process
2022-10-11 10:17:27 +00:00
17074389de index op with int32 support (#86318)
Differential Revision: D40089960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86318
Approved by: https://github.com/malfet
2022-10-11 06:12:17 +00:00
88a8a900b9 fix: half reduction with multiple sub-iterators (#85596)
Fixes #74438

TODO:
* [x] Add test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85596
Approved by: https://github.com/ngimel
2022-10-11 05:40:12 +00:00
55479fe80e Enable capturing of comm collective parameters (#98) (#85368)
Summary:
X-link: https://github.com/facebookresearch/torch_ucc/pull/98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85368
Approved by: https://github.com/H-Huang
2022-10-11 04:38:26 +00:00
ad2b04c39c [torchdynamo hash update] update the pinned torchdynamo hash (#86651)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned torchdynamo hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86651
Approved by: https://github.com/pytorchbot
2022-10-11 03:29:01 +00:00
bd381121b9 [vision hash update] update the pinned vision hash (#86652)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86652
Approved by: https://github.com/pytorchbot
2022-10-11 03:24:32 +00:00
deb414a43f Revert "Use FindCUDAToolkit to find cuda dependencies (#82695)"
This reverts commit fb9b96593c784b86b3d913ef8799ee120c203207.

Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/malfet due to Break cublas packaging into wheel
2022-10-11 02:50:47 +00:00
577070ff96 update fbgemm commit ID in PyTorch (#86577)
Summary:
Update after https://github.com/pytorch/FBGEMM/pull/1388 .

Previous issue: D40216348

Test Plan: CI

Differential Revision: D40219252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86577
Approved by: https://github.com/malfet
2022-10-11 02:15:53 +00:00
d8b971ed25 Fixes for partitioner with symbolic shapes (#86425)
- supports saving symint (and symfloat..) values between fw/bwd, using sketchy logic that probably needs to be improved but seems to work so far
- sets a correct weight=1 for sym nodes for cost purposes
- lets user functions return symints/floats (but if the same symfloat is saved for backward, that gets duplicated annoyingly)
- makes partitioning decisions based on observed trace-time sizes without guarding! (this is sketchy, but it isn't clear that it will lead to bad partitioning choices either)
- improves infra for tracking symint-family of types: is_sym_node() and _py_sym_types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86425
Approved by: https://github.com/ezyang
2022-10-11 01:42:28 +00:00
16f65f178a Nested tensor forward only chunk operations (#85645)
# Summary

Taking over this pr: https://github.com/pytorch/pytorch/pull/83736

Adding support for chunk without autograd support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85645
Approved by: https://github.com/cpuhrsch
2022-10-11 01:21:39 +00:00
4fc0d5341c [PyTorch][Fix] Improve numerical stability of HistogramObserver (#86522)
Summary:
As titled, HistogramObserver may fail in a certain scenario.
Specifically, we originally compute `hist_bin_width` as `(self.max_val - self.min_val) / (self.bins * upsample_rate)`. It's possible that the numerator part is close the the FP32 threshold (1.4e-45) and conducting the division will cause overflow.

Bring some redundent computations to avoid such scenario.

Test Plan: https://pxl.cl/2ggD4 (04490e90ea)

Differential Revision: D40149594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86522
Approved by: https://github.com/jerryzh168
2022-10-11 01:21:16 +00:00
8a47a49d5e [quant] Move the order of x86 engine to avoid changing the default qengine (#86631)
since the default qengine is the last element of the engine in supported_engines list, adding x86 qengine in the end of the list changes the default quantized engine as well. this PR will be a short term fix to revert the changes. We have an issue here to track the proper fix: https://github.com/pytorch/pytorch/issues/86404

Motivation:
a meta internal team found that the inference failed in onednn prepacking with error: "could not create a primitive descriptor for a reorder primitive." in a COPPER_LAKE machine, we are working with intel to repro and fix the problem. in the mean time, we'll revert the changes of default option back to fbgemm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86631
Approved by: https://github.com/vkuzo
2022-10-11 00:07:41 +00:00
224ae0da10 [BE] Fix variable shadowing in CUDACachingAllocator.cpp (#86646)
Test Plan: CI

Differential Revision: D40245365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86646
Approved by: https://github.com/seemethere
2022-10-10 23:52:28 +00:00
2cb330ab15 Acyclic partition patch (#86511)
Fixes #86159 and #86108

Refactored graph partition to check for cyclic dependency on each partition merge, instead of relying on a pre-baked dependency map.

The previous implementation suffers from not updating dependency on existing partition. When a fusion happens, the updated dependency map needs to be propagated to all nodes in the graph, so each node in a partition shares an identical dependency set. Previous implementation suffers from the not identifying cyclic dependency in issue #86159.

Updated implementation does a cyclic check on partitioned graph before attempting a merge of two partitions.

- [x] python repro added with cyclic dependency after partition `TestFXGraphPasses.forward12`
- [x] fix dependency map with updated implementation using cyclic check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86511
Approved by: https://github.com/SherlockNoMad
2022-10-10 23:48:52 +00:00
dd6dd03ff2 Enable output allocation cache (#86100)
Cherry-picked from devel branch: https://github.com/csarofeen/pytorch/pull/2010

turns on accidentally disabled output allocation cache [#2002](https://github.com/csarofeen/pytorch/issues/2002)
Updated check for safety regarding allocation cache by iterating all IterDomain on outputs and enables cache re-use only when no extent value is a consumer of fusion inputs (output sizes is not dependent on scalar inputs).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86100
Approved by: https://github.com/csarofeen
2022-10-10 23:31:21 +00:00
82ed5ca340 [Vulkan] Don't crash immediately if Vulkan context could not be retrieved (#86485)
Test Plan: Internal AIBench test

Reviewed By: SS-JIA

Differential Revision: D40151818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86485
Approved by: https://github.com/kimishpatel
2022-10-10 22:32:44 +00:00
b409d1f65b Turn on Data Dependent Throwing (#86480)
This was already enabled in TorchDynamo, but was staged to make sure things don't break. Also makes backward single threaded for tests to fix a memory leak.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86480
Approved by: https://github.com/bdhirsh
2022-10-10 21:58:29 +00:00
ce7751188a [DDP] Add PackedSequence support when device_ids is specified (#86614)
Before this PR, if a user runs DDP with `device_ids` specified and with a `PackedSequence` input, then the execution will error with something like:
```
raise ValueError(
  ValueError: batch_sizes should always be on CPU. Instances of PackedSequence should never be created manually. They should be instantiated by
 functions like pack_sequence and pack_padded_sequences in nn.utils.rnn. https://pytorch.org/docs/stable/nn.html...
```
This is because the DDP forward calls `_to_kwargs()`, which calls `_recursive_to()`, which moves the inputs to GPU. However, `_is_namedtuple(packed_sequence)` returns `True`, leading to the branch `return [type(obj)(*args) for args in zip(*map(to_map, obj))]`, which tries to construct a `PackedSequence` directly via `type(obj)(*args)`, leading to the error.

Repro for `_is_namedtuple(packed_sequence)` returning `True`:
```
import random

import torch
import torch.nn.utils.rnn as rnn_utils
from torch.nn.parallel.scatter_gather import _is_namedtuple

def _ordered_sequence(tensor_type):
    seqs = [tensor_type(random.randint(1, 256))
            for _ in range(32)]
    seqs = [s.random_(-128, 128) for s in seqs]
    ordered = sorted(seqs, key=len, reverse=True)
    return ordered

def _padded_sequence(tensor_type):
    ordered = _ordered_sequence(tensor_type)
    lengths = [len(i) for i in ordered]
    padded_tensor = rnn_utils.pad_sequence(ordered)
    return padded_tensor, lengths

padded, lengths = _padded_sequence(torch.Tensor)
packed = rnn_utils.pack_padded_sequence(
    padded, lengths, enforce_sorted=False)
print(type(packed), packed.data.device)
print(_is_namedtuple(packed))
```

Test Plan:
```
python test/distributed/test_c10d_nccl.py -k test_ddp_packed_sequence
```
Without the fix, the added unit test fails with the expected error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86614
Approved by: https://github.com/rohan-varma
2022-10-10 21:50:59 +00:00
b7b5bd47ae [MPS] Implement frac operator (#86625)
As combination if self-trunc

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86625
Approved by: https://github.com/kulinseth, https://github.com/albanD
2022-10-10 20:36:22 +00:00
885122b7dc Move PadNd from ATen/native to ATen (#82379)
Summary:
This header is being included from both aten/native and torch/csrc, but
some of our build configurations don't allow direct dependencies from
torch/csrc to atent/native, so put the header in aten where it's always
accessible.

Resolves https://github.com/pytorch/pytorch/issues/81198

Test Plan:
CI.
```
./scripts/build_android.sh
env ANDROID_ABI="x86_64" ANDROID_NDK=".../ndk-bundle" CMAKE_CXX_COMPILER_LAUNCHER=ccache CMAKE_C_COMPILER_LAUNCHER=ccache USE_VULKAN=0 ./scripts/build_android.sh
echo '#include <torch/torch.h>' > test.cpp
g++ -E -I $PWD/build_android/install/include/ -I $PWD/build_android/install/include/torch/csrc/api/include test.cpp >/dev/null
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82379
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-10-10 20:26:57 +00:00
e2a4dfa468 Add correct __all__ for torch.distributed and torch.cuda submodules (#85702)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85702
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/rohan-varma
2022-10-10 19:15:24 +00:00
d93b1b9c4e Address feedback from previous PR (#86622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86622
Approved by: https://github.com/albanD
2022-10-10 18:53:41 +00:00
d792d75091 [quant][fix] Fix the call to get_executorch_backend_config (#86338)
Summary:
previously the call failed because there was an infinite loop in _get_share_qparams_ops_configs

Test Plan:
python test/test_quantization.py -k test_get_executorch_backend_config

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86338
Approved by: https://github.com/andrewor14
2022-10-10 18:52:26 +00:00
2288a1c806 Added new option any_common_cpu_cuda_one to OpDTypes (#86286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86286
Approved by: https://github.com/lezcano, https://github.com/mruberry
2022-10-10 17:47:11 +00:00
8f2dda5bf2 [CI] Build MacOS M1 binaries without distributed support (#86451)
Partial fix for #86448 which causes the broken code to be exercised in CI. If this demonstrates the break, I'm not sure whether there should be a fix forward of https://github.com/pytorch/pytorch/pull/85781 or a revert
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86451
Approved by: https://github.com/malfet
2022-10-10 17:42:13 +00:00
dcc3ae98b7 [NestedTensor] Add a contiguous checks to get_buffer (#86496)
# Summary
Many NestedTensor ops are implemented using a connivence function named get_buffer. This returns a dense, contiguous tensor that is a view of the underlying storage of the NestedTensor. This function allows NestedTensor ops to piggy back off of the implementations for dense tensor under certain scenarios.  This PR adds a TORCH_CHECK() to get buffer to insure that the calling NT is in fact contiguous. It also adds an "unsafe" version for a few ops that are designed to handle contiguity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86496
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
2022-10-10 17:37:19 +00:00
ad449b338f [8/N] [Dispatchable Collectives] Update allgather with CPU / CUDA implementations  (#84423)
### Changes
- Updates for the allgather collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84423
Approved by: https://github.com/kwen2501
2022-10-10 17:18:48 +00:00
9eb771583c symintify rand and randint functions and meta suport for randint (#86358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86358
Approved by: https://github.com/ezyang, https://github.com/albanD
2022-10-10 17:07:11 +00:00
67358ee124 MaxPool: correct pooling description (#86559)
In the documentation of `nn.MaxPool2d` and `nn.MaxPool3d`, the argument description of `padding` incorrectly states that zero padding is applied. The remainder of the documentation correctly states that negative infinity padding is applied.

The documentation of `padding` in `nn.MaxPool1d`, `nn.functional.max_pool1d/2d/3d` is correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86559
Approved by: https://github.com/albanD
2022-10-10 16:57:54 +00:00
16a0fa1204 Enable max.unary_out (#85926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85926
Approved by: https://github.com/bdhirsh
2022-10-10 16:53:33 +00:00
e18d466f35 [test_nn] split lazy_modules from test_nn (#86526)
Ref: #63085

NOTE: We don't need an accompanying XLA PR as these tests run only on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86526
Approved by: https://github.com/albanD
2022-10-10 16:29:56 +00:00
8a1fc5d2f8 [7/N] [Dispatchable Collectives] Update reduce with CPU / CUDA implementations (#83916)
### Changes
- Updates for the reduce collective

### Context
https://github.com/pytorch/pytorch/issues/86225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83916
Approved by: https://github.com/kwen2501
2022-10-10 15:58:37 +00:00
978b46d7c9 Reland 2 of Merge more symbolic meta kernels and symint changes from branch (#86334) (#86488)
symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops

add meta_bernoulli_

meta kernel for at::gather

get pytorch_struct to pass: meta for scatter_add, fix backward

symintify split ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86488
Approved by: https://github.com/ezyang
2022-10-10 15:54:28 +00:00
55663b7f81 Reland 3 of Symintify getitem and add the required helper functions (#86207) (#86487)
Note that this might not cover every use of the function (we know it doesn't)
But this is enough to get few models passing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86487
Approved by: https://github.com/ezyang
2022-10-10 15:54:28 +00:00
4a5fdc56ec fix some composite compliance ops for functionalization (#86470)
Confirmed that this fixes https://github.com/pytorch/pytorch/issues/86384

cc @tugsbayasgalan

Functionalization should be included in the "isSubclass" checks that we run, for composite operators that have a different path for composite compliance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86470
Approved by: https://github.com/ezyang, https://github.com/zou3519
2022-10-10 14:27:18 +00:00
5102f0cffc [FSDP][1/N] Retire FlattenParamsWrapper (#86117)
This deprecates `FlattenParamsWrapper`'s usage for "unflattening" the original parameters. After this PR, FPW only serves to register and de-register its `FlatParameter` for the parent `FullyShardedDataParallel` instance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86117
Approved by: https://github.com/zhaojuanmao
2022-10-10 11:38:44 +00:00
bf7c46facf [xla hash update] update the pinned xla hash (#86099)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86099
Approved by: https://github.com/pytorchbot
2022-10-10 10:47:40 +00:00
5844f00bbf [FSDP] Add low_prec prefix to param and reduce dtype varnames (#86512)
This PR renames `param_dtype` and `reduce_dtype` in `HandleConfig` to `low_prec_param_dtype` and `low_prec_reduce_dtype` to emphasize that they are meant to be of the low precision (if not `None`).

(In my mind, mixed precision refers to the paradigm of using both full and low precision together during training. "Reduced" and "low precision" mean the same thing, but I prefer the term "low precision" in the code since it is shorter. A particular dtype can be a low precision dtype or a full precision dtype.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86512
Approved by: https://github.com/zhaojuanmao
2022-10-10 09:33:33 +00:00
cc5de7f1ac [FSDP] Remove utils.py (moved to _utils.py) (#86528)
I messed up my git with an earlier PR, where I did not actually remove `utils.py` when moving it to `_utils.py`. This removes `utils.py`, which is now outdated and unused.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86528
Approved by: https://github.com/H-Huang
2022-10-10 09:31:01 +00:00
c6b7c33885 torchdynamo: add linear eltwise fusion kernel (#85622)
Support fusion of linear with:

- relu
- sigmoid
- tanh
- hardswish
- leaky_relu
- hardtanh
- gelu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85622
Approved by: https://github.com/EikanWang, https://github.com/jansel
2022-10-10 05:47:11 +00:00
ec2d22ece0 [torchdynamo hash update] update the pinned torchdynamo hash (#86567)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned torchdynamo hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86567
Approved by: https://github.com/pytorchbot
2022-10-10 03:26:25 +00:00
753536b7a5 BlasKernel: Improve gemm's inner dot product when a is transposed (#80977)
`gemm_transab_` accumulates the sum in the output, despite the inner
loop being over a single output element. This changes it to accumulate
in a register, which also avoids early truncation for bfloat16.

I've also factored out a generic `sum` function that can be shared
with `gemm_transa_` to handle unrolling and multiple accumulators.

I have benchmarked addmm for bfloat16 with shapes
(320,600) X (600,320) and for both layouts I see a significant
speedup.

|  layout  | Before (ms) | After (ms) |
|----------|-------------|------------|
| transa   | 71.5        | 31         |
| transab  | 249         | 35         |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80977
Approved by: https://github.com/ngimel
2022-10-09 22:56:29 +00:00
a45fead623 mkl: Use per-operator headers (#75570)
Differential Revision: [D40126703](https://our.internmc.facebook.com/intern/diff/D40126703)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75570
Approved by: https://github.com/malfet
2022-10-09 20:12:55 +00:00
c89d286af6 symintify unbind_backward and tensor_split (#86357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86357
Approved by: https://github.com/albanD
2022-10-09 16:25:55 +00:00
a6c0442cce Add __all__ to torch.{autograd, fx, cuda} submodules (#85343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85343
Approved by: https://github.com/albanD
2022-10-09 14:46:54 +00:00
6aec0d3ddb [BE] Remove remaining cuda-11.3 builds (#86540)
`linux-bionic-cuda11_3-py3_7-clang9-build` is redundant is is covered by `linux-jammy-cuda11.6-cudnn8-py3.8-clang12`

And migrate no-per-operator header build (which mimics internal behavior) from  `linux-xenial-cuda11.3` to `linux-bionic-cuda11.7`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86540
Approved by: https://github.com/weiwangmeta, https://github.com/atalman
2022-10-09 14:20:46 +00:00
7134b9bc7b Quantized: Use per-operator headers (#75569)
Differential Revision: [D40126700](https://our.internmc.facebook.com/intern/diff/D40126700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75569
Approved by: https://github.com/malfet
2022-10-09 07:46:13 +00:00
67434c70df [MPS] Fix printTensor() for MPS (#86534)
MPS does not support double type, so tensor need to be cast to CPU first
before it can be cast to double.

Also, do a little bit of BE, by initializing values and marking unused range variables with C10_UNUSED

Fixes https://github.com/pytorch/pytorch/issues/86410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86534
Approved by: https://github.com/weiwangmeta
2022-10-09 06:47:36 +00:00
9998f9100b [vision hash update] update the pinned vision hash (#86490)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86490
Approved by: https://github.com/pytorchbot
2022-10-09 03:30:07 +00:00
92ac84c98a [torchdynamo hash update] update the pinned torchdynamo hash (#86489)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned torchdynamo hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86489
Approved by: https://github.com/pytorchbot
2022-10-09 03:28:37 +00:00
492d1be5d2 QuantizedCPU: Use per-operator headers (#71217)
Differential Revision: [D33949895](https://our.internmc.facebook.com/intern/diff/D33949895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71217
Approved by: https://github.com/malfet
2022-10-09 03:27:50 +00:00
4bfe2a2450 cuDNN/miopen: Use per-operator headers (#71216)
Differential Revision: [D33949898](https://our.internmc.facebook.com/intern/diff/D33949898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71216
Approved by: https://github.com/malfet
2022-10-08 19:37:20 +00:00
33f0e98a49 Re-land*4 "SymIntify cat and narrow" (#86468)
This re-lands https://github.com/pytorch/pytorch/pull/86289 but with more wrappers.

Contains implicit inclusion of <ATen/native/NonSymbolicBC.h> in internal usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86468
Approved by: https://github.com/albanD
2022-10-08 07:17:37 +00:00
8ea2ed0fc7 Revert "Re-enable torchdynamo tests (#86297)"
This reverts commit e61028813007518bd6be0e6482a8742b84c30da7.

Reverted https://github.com/pytorch/pytorch/pull/86297 on behalf of https://github.com/malfet due to Reverting to return trunk back to green, dynamo shard2 started failing shortly after the merge
2022-10-08 05:14:40 +00:00
d3f7c34cb3 Enable aten-aten decomps (#85921)
Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921
Approved by: https://github.com/ezyang
2022-10-08 05:12:42 +00:00
af9c6bc851 [FSDP] Add keep_low_precision_grads support when CPU offloading (#86495)
When CPU offloading, FSDP uses `_cpu_grad`, not `_saved_grad_shard`. This adds support for `keep_low_precision_grads` for that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86495
Approved by: https://github.com/rohan-varma
2022-10-08 03:26:40 +00:00
7ec12a559c Revert "Enable aten-aten decomps (#85921)"
This reverts commit 62e4f51efdf98a3a91d29efa55e5665d5398b464.

Reverted https://github.com/pytorch/pytorch/pull/85921 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. I think it breaks a dynamo test in trunk 62e4f51efd
2022-10-08 01:59:54 +00:00
b0ceb8ea1c [vulkan] Add buffer to buffer copies (#86424)
Differential Revision: [D40112702](https://our.internmc.facebook.com/intern/diff/D40112702/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86424
Approved by: https://github.com/kimishpatel
2022-10-08 01:32:17 +00:00
511d81cd2a [vulkan] Clean up convolution code (#86423)
Differential Revision: [D39553863](https://our.internmc.facebook.com/intern/diff/D39553863/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86423
Approved by: https://github.com/kimishpatel
2022-10-08 01:28:56 +00:00
b645c237bc make g2p ~30% faster on mobile by suppressing a log (#85907)
Summary: using the tool from D39559248 i was able to make g2p faster on mobile by taking a look at profiles on stella frames. It turned out that the pytorch interpreter code does some logging that ends up being a pretty big bottleneck.

Differential Revision: D39901455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85907
Approved by: https://github.com/dzdang
2022-10-08 01:25:03 +00:00
bac26155e7 [JIT] Allow freezing modules that contain mutable interfaces (#86039)
This PR allows freezing modules like the one below:
```python
# Ex. 1
        @torch.jit.interface
        class ModuleInterface(torch.nn.Module):
            def forward(self, inp: torch.Tensor) -> torch.Tensor:
                pass

        class ImplementsInterface(torch.nn.Module):
            def __init__(self):
                super(ImplementsInterface, self).__init__()
                self.sum = torch.zeros((2, 2))

            def forward(self, inp: torch.Tensor) -> torch.Tensor:
                self.sum += inp.relu()  # this makes the interface-implementing module mutable
                                        # and previously this would prevent freezing
                return self.sum

        class WrapperModule(torch.nn.Module):
            impl: ModuleInterface

            def __init__(self):
                super().__init__()
                self.impl = ImplementsInterface()

            def forward(self, x: torch.Tensor) -> torch.Tensor:
                return self.impl.forward(x)
```

Previously during freezing, we handle interfaces as shown below:
1. we inline interfaces in any preserved method graphs
2. during `cleanupFrozenModule`, we try to simplify the module data structure (<- this part is unrelated to freezing so far). During this step, if we found that a interface type was mutable, we'd error out; because of the possibility of a module that _swaps out the value of an interface-typed attribute at runtime_.

Below is an example of a module that swaps out the value of an interface-typed attribute at runtime:
```python
# Ex. 2
class MyBadModule(torch.nn.Module):
    impl: MyInterface
    option1: IfaceImpl1
    option2: IfaceImpl2
    ....
    def forward(self, x):
        if x > 0:
            self.impl = self.option1
        else:
            self.impl = self.option2
        ....
```

^ this type of situation cannot be supported by freezing (or at least would be difficult to do correctly) because it greatly complicates the details of handling types and simplifying the module data structure.

But we can still support the first example without _too_ much work:
1. inline the interface code as before
2. check to see if we have any setattrs on interface types; if so, error out
3. otherwise, replace the type of the interface types with the concrete type implementation
4. continue simplifying the module data structure as if we never had any interfaces.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86039
Approved by: https://github.com/eellison
2022-10-08 00:38:11 +00:00
04490e90ea better error message fix (#86422)
Summary:
A user had a problem with fx-scripting and the error message can be improved.

Error was shown as:

RuntimeError: Keys for dictionaries used as an argument cannot contain a Node. Got key: {k}

which is obvious not quite helpful.

Test Plan:
Test in a notebook:
{F778667593}

Reviewed By: xunnanxu, SherlockNoMad

Differential Revision: D40157518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86422
Approved by: https://github.com/SherlockNoMad
2022-10-08 00:06:05 +00:00
zaf
3a02873183 [quant][ao_migration] nn.intrinsic.quantized migration to ao (#86172)
All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.quantized`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline.

```
python test/test_quantization.py -- TestAOMigrationNNIntrinsic
```

Internal:

```
buck2 test @mode/dev-nosan //caffe2/test:quantization -- TestAOMigrationNNIntrinsic
```

Differential Revision: [D39425515](https://our.internmc.facebook.com/intern/diff/D39425515/)

Differential Revision: [D39425515](https://our.internmc.facebook.com/intern/diff/D39425515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86172
Approved by: https://github.com/jerryzh168
2022-10-08 00:01:38 +00:00
91b1bae1df Caching allocator tracing (#86241)
We currently can take snapshots of the state of the allocated cuda memory, but we do not have a way to correlate these snapshots with the actions the allocator that were taken between snapshots. This PR adds a simple fixed-sized buffer that records the major actions that the allocator takes (ALLOC, FREE, SEGMENT_ALLOC, SEGMENT_FREE, OOM, SNAPSHOT) and includes these with the snapshot information. Capturing period snapshots with a big enough trace buffer makes it possible to see how the allocator state changes over time.

We plan to use this functionality to guide how settings in the allocator can be adjusted and eventually have a more robust overall algorithm.

As a component of this functionality, we also add the ability to get a callback when the allocator will throw an OOM, primarily so that snapshots can be taken immediately to see why the program ran out of memory (most programs have some C++ state that would free tensors before the OutOfMemory exception can be caught).

This PR also updates the _memory_viz.py script to pretty-print the trace information and provide a better textual summary of snapshots distinguishing between internal and external fragmentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86241
Approved by: https://github.com/ngimel
2022-10-07 23:19:54 +00:00
8a3a54e012 Fix index_select decomp (#86469)
For decomposing index_select with 0-dim tensor, we cannot write `x.unsqueeze(0)[index].squeeze(0).clone()` , as tensor[index] will trigger index.item() if index is a 0-dim tensor, and .item() cannot be symbolically traced with FakeTensor.

We use `torch.ops.aten.index(x.unsqueeze(0), [index]).squeeze(0).clone()` as a workaround.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86469
Approved by: https://github.com/ngimel
2022-10-07 22:59:49 +00:00
a079dad7cf Skip dynamo for all optim test as they are all flaky otherwise (#86482)
Fixes https://github.com/pytorch/pytorch/issues/86433
Fixes https://github.com/pytorch/pytorch/issues/86435
Fixes https://github.com/pytorch/pytorch/issues/86432
Fixes https://github.com/pytorch/pytorch/issues/86389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86482
Approved by: https://github.com/ezyang
2022-10-07 22:47:48 +00:00
ba3fde6aa0 Add multi-grad hooks (#86260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86260
Approved by: https://github.com/albanD
2022-10-07 21:16:45 +00:00
97e56c176d Try to fix shutdown test in edge cases (#86464)
Fixes https://github.com/pytorch/pytorch/issues/85259
See the issue for debugging details.
tl;dr: when a worker thread is actually used, make sure it is initialized before exiting.
Yes, it is very unlikely it will take >10s to initialize but it is what seems to happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86464
Approved by: https://github.com/soulitzer, https://github.com/ezyang
2022-10-07 21:09:40 +00:00
62e4f51efd Enable aten-aten decomps (#85921)
Invokes aten-aten decomps with re-entrant FakeMode. These decomps are being used in other places, so it's good to unify the path static fake tensor takes / get additional testing etc. There is also an instance where we return different devices with cpu/cuda which this fixes ([batch_norm](https://github.com/pytorch/pytorch/blob/master/torch/_decomp/decompositions.py#L1374))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85921
Approved by: https://github.com/ezyang
2022-10-07 21:04:39 +00:00
a95889ba7c [FSDP] Add initial summon_full_params(with_grads=True) (#85738)
This adds `summon_full_params(with_grads=True)` for `use_orig_params=True` and `offload_to_cpu=False`. Filling in the `use_orig_params=False` case requires some already-planned refactoring, and the `offload_to_cpu=True` case needs some additional work as well.

Adding this is helpful for debugging `use_orig_params=True` to make sure gradients are being updated correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85738
Approved by: https://github.com/rohan-varma
2022-10-07 21:03:18 +00:00
82229d1e33 [optim] fix: empty grad support for SparseAdam (#86459)
Fixes #82486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86459
Approved by: https://github.com/albanD
2022-10-07 19:24:59 +00:00
66d480d314 Revert "Disable mac m1 jobs (#86463)"
This reverts commit ac632b437489b4c0c2714d5ad37517bb60e09750.

Reverted https://github.com/pytorch/pytorch/pull/86463 on behalf of https://github.com/huydhn due to Queue is decreasing, re-enable the jobs
2022-10-07 18:55:01 +00:00
ac632b4374 Disable mac m1 jobs (#86463)
There is a queue and some runners are not accessible.

This is to mitigate the Sev https://github.com/pytorch/pytorch/issues/86466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86463
Approved by: https://github.com/clee2000
2022-10-07 18:28:47 +00:00
ac74976a56 [ao] fixing public v private for fuser_method_mappings.py (#86029)
Summary: no significant changes, just added __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86029
Approved by: https://github.com/jerryzh168
2022-10-07 18:11:42 +00:00
be682befbc [FSDP] Add use_orig_params (#84911)
**Overview**
This PR adds the option to use the original parameters via `use_orig_params=True` in the FSDP constructor.
- This exposes the original parameters rather than the `FlatParameter`s from `named_parameters()`, which means that the optimizer runs on the original parameters. Hence, users may assign original parameters from the same `FlatParameter` to different parameter groups.
- This enables decoupling the original parameter variables from their storage without changing the variables themselves, which is critical for our upcoming execution-order-based non-recursive wrapping policy.

For more detailed design explanation, refer to the Quip shared internally.

**Follow-Ups**
See 85831 (removing link to avoid spamming the issue whenever I update this PR).

`test_fsdp_use_orig_params.py` adds ~4 min 46 seconds to the TTS on the AWS cluster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84911
Approved by: https://github.com/rohan-varma
2022-10-07 18:07:17 +00:00
b43ae1c411 Add reference counter in FileStore (#85601)
Fixes #67566.

This diff added a reference counter in the FileStore object. The underlying file would be removed only if the reference counter became 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85601
Approved by: https://github.com/H-Huang
2022-10-07 17:59:29 +00:00
zaf
efccb6401c [quant][ao_migration] nn.intrinsic.qat migration to ao (#86171)
All quantization-related modules are being migrated to `torch.ao`. This migrates the `nn.intrinsic.qat`. Please, see the [tracker](https://github.com/pytorch/pytorch/issues/81667) for the timeline.

```
python test/test_quantization.py TestAOMigrationNNIntrinsic
```

Differential Revision: [D39419993](https://our.internmc.facebook.com/intern/diff/D39419993/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39419993/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86171
Approved by: https://github.com/jerryzh168
2022-10-07 17:29:42 +00:00
e610288130 Re-enable torchdynamo tests (#86297)
We temporarily skipped torchdynamo tests due to many failures, now we fix the problems and re-enable tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86297
Approved by: https://github.com/anijain2305
2022-10-07 17:16:40 +00:00
e8d3b7201c [ao] fixing public v private for fuse_modules.py (#86028)
Summary: no significant changes, just added __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86028
Approved by: https://github.com/jerryzh168
2022-10-07 17:12:33 +00:00
d29912cc06 [ao] fixing public v private for torch/ao/quantization (#86027)
Summary: no significant changes, just needed to add __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86027
Approved by: https://github.com/jerryzh168
2022-10-07 17:12:18 +00:00
65b408074f Revert "Relandx3 "SymIntify cat and narrow" (#86289)"
This reverts commit a00f8489df5586178d7b5f83928bf8049ce32f24.

Reverted https://github.com/pytorch/pytorch/pull/86289 on behalf of https://github.com/malfet due to @seemether  unlanded the rest of the stack and it will fail intern import anyway
2022-10-07 16:29:27 +00:00
5b69b87d5a Revert "Symintify getitem and add the required helper functions (#86207)"
This reverts commit fd5085c445c3f1a4c90e55154cf26fe30f52a0ab.

Reverted https://github.com/pytorch/pytorch/pull/86207 on behalf of https://github.com/seemethere due to  Fails internal tests, see: https://www.internalfb.com/intern/sandcastle/job/22517998926071860/insights
2022-10-07 16:10:30 +00:00
75df4b5e3d Revert "Merge more symbolic meta kernels and symint changes from branch (#86334)"
This reverts commit 08e3999fa494238f8f62346a140da36bd43864e7.

Reverted https://github.com/pytorch/pytorch/pull/86334 on behalf of https://github.com/seemethere due to Trying to revert https://github.com/pytorch/pytorch/pull/86207, this PR causes merge conflicts with the initial revert so will have to revert this as well
2022-10-07 16:03:30 +00:00
b3fdb02fb2 Fix memory leak in _LRScheduler.step() (#85602)
Fixes #85410

This diff removed the cyclic references in `_LRScheduler.step()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602
Approved by: https://github.com/albanD
2022-10-07 15:55:55 +00:00
0e639ff45c Revert "Cleanup PT-D imports (#85781)"
This reverts commit 9a170b24f64d7cfdd887ff122c241ac6ff85f4c6.

Reverted https://github.com/pytorch/pytorch/pull/85781 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-10-07 14:55:44 +00:00
9b2ea41f48 COO intersection primitives : fusing value selection with value intersection. (#86269)
As per title. This one fuses 3 kernels into 1 with about 20-10% performance improvement.
This kernel is also useful for union-like operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86269
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2022-10-07 14:50:48 +00:00
e125baf90b [autocast] Clean up registrations using new macros (#86403)
This PR cleans up m.impl(...) calls to use the new KERNEL / KERNEL_CPU
macros. That saves us the trouble of writing out the signatures.

Test Plan:
- code reading
- wait for tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86403
Approved by: https://github.com/ezyang
2022-10-07 14:14:38 +00:00
9b74267eb6 [autocast] Make it easier to register rules (#86402)
On the way to resolving https://github.com/pytorch/pytorch/issues/86294

Previously, there were three macros used to register autocast rules:
- KERNEL
- KERNEL_DIFFERENT_REDISPATCH_SIGNATURE
- KERNEL_CPU

This PR makes the KERNEL and KERNEL_CPU macros less redundant for users.
KERNEL_DIFFERENT_REDISPATCH_SIGNATURE is weird and only used three
times, so I didn't change them.

Concretely, KERNEL(OP, OP_NAME, SIGNATURE, POLICY) is redundant:
- op/op_name are similar, and the signature can be decltype'd.
PR changes it so that instead, one uses either:
- KERNEL(OP, POLICY)
- KERNEL2(OP, OVERLOAD, POLICY)
depending on whether the operator name has an overload.

This PR also gives the same treatment to the KERNEL_CPU macro, which is
used for registering autocast cpu rules: it splits KERNEL_CPU into
KERNEL_CPU(OP, POLICY) AND KERNEL_CPU2(OP, OVERLOAD, POLICY).

I will do some more cleanup of things that are implemented via
`m.impl(...)` in a follow-up PR so that I don't get confused when I need
to rebase.

Test Plan:
- wait for tests (how good are our autocast tests?)
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86402
Approved by: https://github.com/ezyang
2022-10-07 14:14:38 +00:00
55f5e0de8d remove unused arg from impl_func_cum_ops (#86364)
Fixes #86224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86364
Approved by: https://github.com/bdhirsh
2022-10-07 14:13:15 +00:00
a00f8489df Relandx3 "SymIntify cat and narrow" (#86289)
This reverts commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86289
Approved by: https://github.com/wconstab
2022-10-07 14:04:10 +00:00
cc9183eb4c Update distributed.rst backend collective support chart (#86406)
NCCL `scatter` was added by Wanchao in https://github.com/pytorch/pytorch/pull/70029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86406
Approved by: https://github.com/wanchaol
2022-10-07 12:59:09 +00:00
b74ca31bf6 [fix] sum_to_size: MathBits test - don't reuse same input tensor (#86378)
Fixes #85409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86378
Approved by: https://github.com/anjali411
2022-10-07 12:12:03 +00:00
facbddb9ff Override Quantized Backend to use Fbgemm in Qlinear Packed Params Test (#86236)
Summary: After D39934051, we must explicitly ```override_quantized_engine('fbgemm')``` for this test to work

Test Plan:
```
buck test //caffe2/test:ao -- TestQlinearPackedParams
```

Before:
```
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/5629499663624574
    ✓ ListingSuccess: caffe2/test:ao : 72 tests discovered (32.830)
    ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_qnnpack (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (25.085)
    ✗ Fail: caffe2/test:ao - test_qlinear_packed_params (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (26.706)
Test output:
> RuntimeError: Didn't find engine for operation ao::sparse::qlinear_prepack X86
```

After:
```
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/7599824485968786
    ✓ ListingSuccess: caffe2/test:ao : 72 tests discovered (31.082)
    ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_fbgemm (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (100.409)
    ✓ Pass: caffe2/test:ao - test_qlinear_packed_params_qnnpack (ao.sparsity.test_qlinear_packed_params.TestQlinearPackedParams) (100.544)
Summary
  Pass: 2
  ListingSuccess: 1
```

Differential Revision: D40078176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86236
Approved by: https://github.com/jmdetloff, https://github.com/z-a-f
2022-10-07 11:58:41 +00:00
dbea07b6aa [Profiler] record gradient from nnModule (#86355)
Summary:
- catch .grad tensor info
- update data type and `check_and_store`, etc
- update unit test case

Test Plan: buck run mode/opt //caffe2/test:profiler

Reviewed By: chaekit

Differential Revision: D39711295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86355
Approved by: https://github.com/chaekit
2022-10-07 09:58:50 +00:00
28a0b3fb18 Fix col2im and im2col decompositions (#86426)
I threw in some tests for good measure.

Fixes https://github.com/pytorch/pytorch/issues/86332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86426
Approved by: https://github.com/ngimel
2022-10-07 08:14:06 +00:00
93b2d99158 Improve interpolate() speed for channels_last images and masks (#86361)
This PR improves the speed of `interpolate()`:
-  on images and masks (`num_channels < 4`, `channels_last=True`)
- for the following modes: linear (antialias=False), nearest (int and float), and nearest-exact (int and float)
- for both upsampling and downsampling

The actual speed-up ranges from 1.1X to 110X, but this depends on various factors like number of threads and of course input_size/output_size.  In a typical torchvision ImageNet training job (where num_threads=1 because of DataLoader multi-processing), the following speed-ups should be expected (I ran much more benchmarks than this one, see below for more details):

```
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms
```

An immediate follow-up to this PR would be to do the same changes for the 3D kernels.
Thanks a ton @fmassa for the help!

### Speedup benchmarks:

Results:

<details>

```
----------------------------------------------------------------------------------------------------
(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   1.6X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   1.7X  1.0ms vs 0.5ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=1   8X    0.806ms vs 0.097ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=1   15X   0.848ms vs 0.056ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=1   10X   0.828ms vs 0.084ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=1   16X   0.914ms vs 0.057ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=1   10X   0.900ms vs 0.086ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=2   1.6X  1.1ms vs 0.7ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   1.6X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   1.7X  0.6ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   1.7X  0.5ms vs 0.3ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=2   9X    0.800ms vs 0.088ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=2   11X   0.459ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=2   7X    0.424ms vs 0.064ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=2   12X   0.503ms vs 0.043ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=2   8X    0.461ms vs 0.059ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=12  3X    1.1ms vs 0.3ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  1.6X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=12  5X    0.8ms vs 0.2ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=12  10X   0.445ms vs 0.047ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=12  7X    0.432ms vs 0.062ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=12  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=12  7X    0.470ms vs 0.063ms

(1, 3, 64, 64) -> (224, 224)    linear          float32    num_threads=32  3X    1.1ms vs 0.4ms
(1, 3, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  1.8X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 1, 64, 64) -> (224, 224)    linear          float32    num_threads=32  11X   0.815ms vs 0.074ms
(1, 1, 64, 64) -> (224, 224)    nearest         float32    num_threads=32  10X   0.443ms vs 0.045ms
(1, 1, 64, 64) -> (224, 224)    nearest         uint8      num_threads=32  7X    0.436ms vs 0.061ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 64, 64) -> (224, 224)    nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.061ms
----------------------------------------------------------------------------------------------------
(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=1   0.9X  0.9ms vs 1.1ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   1.5X  0.9ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   1.6X  1.0ms vs 0.6ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=1   8X    0.808ms vs 0.099ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=1   15X   0.848ms vs 0.058ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.820ms vs 0.087ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=1   16X   0.909ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.898ms vs 0.088ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=2   1.4X  0.9ms vs 0.7ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   1.5X  0.5ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   1.5X  0.5ms vs 0.4ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=2   9X    0.799ms vs 0.090ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=2   10X   0.459ms vs 0.045ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.427ms vs 0.059ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=2   11X   0.501ms vs 0.044ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=2   8X    0.460ms vs 0.060ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=12  2.9X  1.0ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  1.2X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  1.1X  0.2ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=12  12X   0.809ms vs 0.068ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.432ms vs 0.055ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.480ms vs 0.041ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.464ms vs 0.056ms

(1, 3, 128, 128) -> (224, 224)  linear          float32    num_threads=32  3X    1.1ms vs 0.3ms
(1, 3, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  1.3X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  1.4X  0.3ms vs 0.2ms
(1, 3, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 128, 128) -> (224, 224)  linear          float32    num_threads=32  11X   0.813ms vs 0.075ms
(1, 1, 128, 128) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.433ms vs 0.061ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   float32    num_threads=32  10X   0.478ms vs 0.046ms
(1, 1, 128, 128) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.062ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=1   0.9X  4.5ms vs 5.2ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   1.5X  4.2ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   1.8X  4.1ms vs 2.3ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   1.6X  4.5ms vs 2.8ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   1.9X  4.4ms vs 2.3ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=1   9X    3.8ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=1   17X   4.0ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=1   11X   3.9ms vs 0.4ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=1   19X   4.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=1   12X   4.3ms vs 0.4ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=2   1.5X  4.5ms vs 3.1ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   1.4X  2.3ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   1.7X  2.1ms vs 1.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   1.6X  2.5ms vs 1.6ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   1.8X  2.2ms vs 1.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=2   15X   3.8ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=2   15X   2.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=2   7X    2.0ms vs 0.3ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=2   16X   2.4ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=2   8X    2.2ms vs 0.3ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=12  8X    5.2ms vs 0.7ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  1.3X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  1.4X  0.6ms vs 0.4ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=12  36X   3.9ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=12  10X   0.526ms vs 0.051ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=12  7X    0.514ms vs 0.069ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=12  11X   0.569ms vs 0.052ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=12  8X    0.557ms vs 0.070ms

(1, 3, 224, 224) -> (600, 400)  linear          float32    num_threads=32  9X    4.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  0.5X  0.2ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  1.0X  0.5ms vs 0.5ms
(1, 3, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (600, 400)  linear          float32    num_threads=32  44X   3.864ms vs 0.087ms
(1, 1, 224, 224) -> (600, 400)  nearest         float32    num_threads=32  10X   0.527ms vs 0.053ms
(1, 1, 224, 224) -> (600, 400)  nearest         uint8      num_threads=32  7X    0.516ms vs 0.070ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   float32    num_threads=32  10X   0.567ms vs 0.055ms
(1, 1, 224, 224) -> (600, 400)  nearest-exact   uint8      num_threads=32  8X    0.558ms vs 0.072ms
----------------------------------------------------------------------------------------------------
(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=1   1.0X  1.9ms vs 1.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   2.0X  1.8ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   1.7X  1.8ms vs 1.0ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   2.1X  1.9ms vs 0.9ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   1.9X  1.9ms vs 1.0ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=1   9X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=1   16X   1.7ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=1   10X   1.7ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=1   17X   1.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=1   11X   1.8ms vs 0.2ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=2   1.7X  1.9ms vs 1.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   2.0X  1.0ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   1.7X  0.9ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   2.3X  1.1ms vs 0.5ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   1.8X  1.0ms vs 0.5ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=2   8X    1.6ms vs 0.2ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=2   14X   0.931ms vs 0.067ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=2   7X    0.9ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=2   15X   1.016ms vs 0.069ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=2   9X    0.9ms vs 0.1ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=12  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=12  20X   1.630ms vs 0.081ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=12  10X   0.457ms vs 0.044ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=12  7X    0.439ms vs 0.060ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=12  11X   0.485ms vs 0.045ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=12  8X    0.474ms vs 0.061ms

(1, 3, 256, 256) -> (320, 320)  linear          float32    num_threads=32  8X    1.9ms vs 0.3ms
(1, 3, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  2.0X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  1.4X  0.2ms vs 0.2ms
(1, 3, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 256, 256) -> (320, 320)  linear          float32    num_threads=32  21X   1.628ms vs 0.078ms
(1, 1, 256, 256) -> (320, 320)  nearest         float32    num_threads=32  9X    0.453ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest         uint8      num_threads=32  7X    0.445ms vs 0.063ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   float32    num_threads=32  11X   0.535ms vs 0.048ms
(1, 1, 256, 256) -> (320, 320)  nearest-exact   uint8      num_threads=32  8X    0.502ms vs 0.063ms
----------------------------------------------------------------------------------------------------
(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=1   1.0X  13.8ms vs 14.0ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   1.8X  13.1ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   1.8X  11.1ms vs 6.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   1.9X  13.9ms vs 7.4ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   1.9X  11.8ms vs 6.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=1   10X   10.2ms vs 1.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=1   19X   10.8ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=1   11X   10.4ms vs 0.9ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=1   20X   11.6ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=1   12X   11.4ms vs 0.9ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=2   1.8X  13.7ms vs 7.7ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   2.6X  7.3ms vs 2.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   1.8X  5.6ms vs 3.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   1.9X  7.9ms vs 4.1ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   1.9X  6.0ms vs 3.1ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=2   18X   10.1ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=2   19X   5.8ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=2   10X   5.3ms vs 0.5ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=2   20X   6.3ms vs 0.3ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=2   11X   5.7ms vs 0.5ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=12  8X    13.8ms vs 1.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  2.9X  1.5ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  1.7X  1.0ms vs 0.5ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  1.5X  1.5ms vs 1.0ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  1.8X  1.0ms vs 0.6ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=12  80X   10.1ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=12  13X   0.928ms vs 0.072ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=12  8X    0.9ms vs 0.1ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=12  13X   1.001ms vs 0.074ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=12  9X    1.0ms vs 0.1ms

(1, 3, 500, 500) -> (800, 800)  linear          float32    num_threads=32  18X   14.0ms vs 0.8ms
(1, 3, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  1.9X  1.0ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  2.9X  0.7ms vs 0.2ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  1.7X  0.9ms vs 0.6ms
(1, 3, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  1.8X  0.4ms vs 0.2ms
(1, 1, 500, 500) -> (800, 800)  linear          float32    num_threads=32  111X  10.254ms vs 0.092ms
(1, 1, 500, 500) -> (800, 800)  nearest         float32    num_threads=32  14X   0.784ms vs 0.056ms
(1, 1, 500, 500) -> (800, 800)  nearest         uint8      num_threads=32  7X    0.551ms vs 0.075ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   float32    num_threads=32  11X   0.607ms vs 0.057ms
(1, 1, 500, 500) -> (800, 800)  nearest-exact   uint8      num_threads=32  8X    0.596ms vs 0.076ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.077ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=1   1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=1   1.0X  0.074ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=1   1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=1   1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=1   0.9X  0.078ms vs 0.084ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.076ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.075ms vs 0.074ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.082ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.080ms vs 0.083ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=2   1.0X  0.070ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=2   1.0X  0.073ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=2   1.0X  0.071ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=2   1.0X  0.079ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=2   1.0X  0.077ms vs 0.079ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.083ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.080ms vs 0.078ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.077ms vs 0.075ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.083ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=12  1.0X  0.071ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=12  1.0X  0.076ms vs 0.074ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=12  1.0X  0.073ms vs 0.071ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=12  1.0X  0.080ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=12  1.0X  0.080ms vs 0.078ms

(1, 3, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.084ms vs 0.084ms
(1, 3, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.078ms vs 0.077ms
(1, 3, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.076ms vs 0.076ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.083ms vs 0.083ms
(1, 3, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.081ms vs 0.082ms
(1, 1, 224, 224) -> (64, 64)    linear          float32    num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest         float32    num_threads=32  1.0X  0.074ms vs 0.075ms
(1, 1, 224, 224) -> (64, 64)    nearest         uint8      num_threads=32  1.0X  0.072ms vs 0.072ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   float32    num_threads=32  1.0X  0.077ms vs 0.080ms
(1, 1, 224, 224) -> (64, 64)    nearest-exact   uint8      num_threads=32  1.0X  0.076ms vs 0.079ms
----------------------------------------------------------------------------------------------------
(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=1   1.0X  0.3ms vs 0.3ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   1.8X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   1.6X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   2.0X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   1.7X  0.3ms vs 0.2ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=1   6X    0.265ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=1   10X   0.280ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=1   7X    0.273ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=1   11X   0.303ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=1   8X    0.297ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=2   1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   1.8X  0.163ms vs 0.093ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   1.9X  0.180ms vs 0.096ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=2   6X    0.264ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=2   10X   0.278ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=2   7X    0.270ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=2   11X   0.298ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=2   8X    0.293ms vs 0.037ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=12  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  1.7X  0.158ms vs 0.095ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  1.7X  0.170ms vs 0.100ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=12  6X    0.269ms vs 0.043ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=12  11X   0.291ms vs 0.027ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=12  8X    0.281ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=12  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=12  8X    0.306ms vs 0.038ms

(1, 3, 224, 224) -> (128, 128)  linear          float32    num_threads=32  1.5X  0.3ms vs 0.2ms
(1, 3, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  1.6X  0.160ms vs 0.098ms
(1, 3, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  1.7X  0.171ms vs 0.099ms
(1, 3, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 224, 224) -> (128, 128)  linear          float32    num_threads=32  6X    0.269ms vs 0.044ms
(1, 1, 224, 224) -> (128, 128)  nearest         float32    num_threads=32  10X   0.282ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest         uint8      num_threads=32  7X    0.276ms vs 0.037ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   float32    num_threads=32  11X   0.305ms vs 0.028ms
(1, 1, 224, 224) -> (128, 128)  nearest-exact   uint8      num_threads=32  8X    0.299ms vs 0.038ms
----------------------------------------------------------------------------------------------------
(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=1   1.0X  1.2ms vs 1.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   2.0X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   1.7X  1.1ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   2.1X  1.2ms vs 0.6ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   1.9X  1.2ms vs 0.7ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=1   8X    1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=1   15X   1.109ms vs 0.073ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=1   10X   1.1ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=1   16X   1.192ms vs 0.074ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=1   11X   1.2ms vs 0.1ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=2   1.7X  1.2ms vs 0.7ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   2.0X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   1.7X  0.6ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   2.2X  0.7ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   1.8X  0.6ms vs 0.3ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=2   9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=2   11X   0.598ms vs 0.052ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=2   8X    0.556ms vs 0.072ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=2   12X   0.649ms vs 0.053ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=2   8X    0.598ms vs 0.073ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=12  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  1.3X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=12  9X    1.0ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=12  12X   0.572ms vs 0.048ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=12  8X    0.560ms vs 0.068ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=12  13X   0.617ms vs 0.049ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=12  9X    0.604ms vs 0.068ms

(1, 3, 320, 320) -> (256, 256)  linear          float32    num_threads=32  5X    1.2ms vs 0.3ms
(1, 3, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 3, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  1.4X  0.2ms vs 0.1ms
(1, 1, 320, 320) -> (256, 256)  linear          float32    num_threads=32  13X   1.042ms vs 0.081ms
(1, 1, 320, 320) -> (256, 256)  nearest         float32    num_threads=32  12X   0.586ms vs 0.050ms
(1, 1, 320, 320) -> (256, 256)  nearest         uint8      num_threads=32  8X    0.562ms vs 0.069ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   float32    num_threads=32  12X   0.621ms vs 0.051ms
(1, 1, 320, 320) -> (256, 256)  nearest-exact   uint8      num_threads=32  9X    0.609ms vs 0.070ms
----------------------------------------------------------------------------------------------------
(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=1   1.0X  1.0ms vs 1.0ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   1.9X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   1.7X  0.9ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   2.1X  1.0ms vs 0.5ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   1.8X  0.9ms vs 0.5ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=1   7X    0.8ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=1   14X   0.852ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=1   9X    0.828ms vs 0.087ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=1   15X   0.922ms vs 0.061ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=1   10X   0.897ms vs 0.087ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=2   1.6X  0.9ms vs 0.6ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   1.9X  0.5ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   1.7X  0.4ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   2.1X  0.5ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   1.8X  0.5ms vs 0.3ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=2   10X   0.808ms vs 0.084ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=2   10X   0.462ms vs 0.046ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=2   7X    0.429ms vs 0.062ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=2   12X   0.504ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=2   7X    0.461ms vs 0.063ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=12  4X    1.0ms vs 0.2ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  1.9X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=12  12X   0.820ms vs 0.067ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=12  11X   0.438ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=12  8X    0.431ms vs 0.056ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=12  12X   0.482ms vs 0.041ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=12  8X    0.467ms vs 0.056ms

(1, 3, 600, 400) -> (224, 224)  linear          float32    num_threads=32  4X    1.0ms vs 0.3ms
(1, 3, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  1.7X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  1.5X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 600, 400) -> (224, 224)  linear          float32    num_threads=32  12X   0.824ms vs 0.070ms
(1, 1, 600, 400) -> (224, 224)  nearest         float32    num_threads=32  10X   0.443ms vs 0.044ms
(1, 1, 600, 400) -> (224, 224)  nearest         uint8      num_threads=32  7X    0.438ms vs 0.059ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   float32    num_threads=32  11X   0.479ms vs 0.045ms
(1, 1, 600, 400) -> (224, 224)  nearest-exact   uint8      num_threads=32  8X    0.470ms vs 0.059ms
----------------------------------------------------------------------------------------------------
(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=1   1.0X  4.7ms vs 4.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   2.0X  4.4ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   1.8X  4.3ms vs 2.5ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   2.1X  4.7ms vs 2.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   1.9X  4.6ms vs 2.5ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=1   9X    4.0ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=1   17X   4.2ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=1   11X   4.1ms vs 0.4ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=1   19X   4.6ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=1   12X   4.5ms vs 0.4ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=2   1.7X  4.7ms vs 2.7ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   2.1X  2.4ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   1.8X  2.2ms vs 1.3ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   2.3X  2.6ms vs 1.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   1.9X  2.3ms vs 1.3ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=2   15X   4.0ms vs 0.3ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=2   16X   2.3ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=2   9X    2.1ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=2   17X   2.5ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=2   10X   2.3ms vs 0.2ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=12  10X   4.7ms vs 0.5ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  1.7X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  1.9X  0.4ms vs 0.2ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  1.8X  0.4ms vs 0.2ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=12  41X   3.969ms vs 0.096ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=12  11X   0.545ms vs 0.051ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=12  8X    0.532ms vs 0.070ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=12  11X   0.590ms vs 0.052ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=12  8X    0.578ms vs 0.071ms

(1, 3, 800, 800) -> (500, 500)  linear          float32    num_threads=32  17X   4.7ms vs 0.3ms
(1, 3, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  1.8X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  2.0X  0.3ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  1.9X  0.2ms vs 0.1ms
(1, 3, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  1.6X  0.2ms vs 0.1ms
(1, 1, 800, 800) -> (500, 500)  linear          float32    num_threads=32  45X   4.028ms vs 0.090ms
(1, 1, 800, 800) -> (500, 500)  nearest         float32    num_threads=32  10X   0.549ms vs 0.053ms
(1, 1, 800, 800) -> (500, 500)  nearest         uint8      num_threads=32  7X    0.536ms vs 0.072ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   float32    num_threads=32  11X   0.592ms vs 0.055ms
(1, 1, 800, 800) -> (500, 500)  nearest-exact   uint8      num_threads=32  8X    0.581ms vs 0.074ms

```
</details>

Code:

<details>

I used this file which is adapted from https://github.com/pytorch/pytorch/blob/master/benchmarks/operator_benchmark/pt/interpolate_test.py

```py
import operator_benchmark as op_bench
import torch

"""Microbenchmarks for interpolate operator."""

class InterpolateBenchmark(op_bench.TorchBenchmarkBase):
    def init(self, input_size, output_size, channels_last=False, mode='linear', dtype=torch.float):

        input_image = torch.randint(0, 256, size=input_size, dtype=dtype, device='cpu',
                                    requires_grad=self.auto_set())
        if channels_last:
            if input_image.ndim == 4:
                input_image = input_image.contiguous(memory_format=torch.channels_last)
            elif input_image.ndim == 5:
                input_image = input_image.contiguous(memory_format=torch.channels_last_3d)
            else:
                raise ValueError(
                    f"Can not set channels_last to the input of {input_image.ndim} dims"
                )

        align_corners = None if "nearest" in mode else False

        if mode == "linear":
            mode = {
                3: 'linear',
                4: 'bilinear',
                5: 'trilinear',
            }[input_image.ndim]

        self.inputs = {
            "input_image": input_image,
            "output_size": output_size,
            "mode": mode,
            "align_corners": align_corners,
        }

        self.set_module_name("interpolate")

    def forward(self, input_image, output_size, mode, align_corners):
        return torch.nn.functional.interpolate(input_image, size=output_size, mode=mode,
                                               align_corners=align_corners)

def make_config():
    sizes = (
        ((224, 224), (64, 64)),
        ((224, 224), (128, 128)),
        ((600, 400), (224, 224)),
        ((320, 320), (256, 256)),
        ((800, 800), (500, 500)),
    )

    attrs = []
    for (HW1, HW2) in sizes:
        attrs.append([(1, 3, *HW1), HW2])  # 3 channels
        attrs.append([(1, 1, *HW1), HW2])  # 1 channel

        attrs.append([(1, 3, *HW2), HW1])  # 3 channels
        attrs.append([(1, 1, *HW2), HW1])  # 1 channel

    config = op_bench.config_list(
        attr_names=["input_size", "output_size"],
        attrs=attrs,
        cross_product_configs={
            'channels_last': [True],
            'mode': ["linear", "nearest", "nearest-exact"],
            'dtype': [torch.float, torch.uint8]
        },
        tags=["short"],
    )

    # Need to remove instances with both torch.int and linear
    # Note: this is naaaasty
    def get_mode(l):
        for d in l:
            if "mode" in d:
                return d["mode"]
    def get_dtype(l):
        for d in l:
            if "dtype" in d:
                return d["dtype"]
    config = [l for l in config if not(get_mode(l) == "linear" and get_dtype(l) == torch.uint8)]
    return config

config = make_config()
op_bench.generate_pt_test(config, InterpolateBenchmark)

if __name__ == "__main__":
    op_bench.benchmark_runner.main()
```

with

```
for num_threads in 1 2 12 32; do echo "num_threads=$num_threads" && python -m pt.my_interpolate_test --iterations 1000 --omp_num_threads $num_threads ; done > $out_file
```

and this very ugly helper

```py
import re
with open("main") as f:
    main = f.readlines()

with open("new") as f:
    new = f.readlines()

out = []

for main_line, new_line in zip(main, new):
    if main_line.startswith("num_threads="):
        num_threads = int(main_line.split("=")[-1])
    if main_line.startswith("# Input"):
        deets = f"{main_line.strip()}, {num_threads=}"
    if main_line.startswith("Forward"):
        main_time = float(main_line.split()[-1])
        new_time = float(new_line.split()[-1])
        ratio = main_time / new_time
        fmt = ".1f" if ratio < 3 else ".0f"
        improv = f"{ratio:{fmt}}X"
        time_fmt = ",.3f" if new_time < 100 else ",.1f"
        deets = deets.strip().replace("# Input: ", "")
        deets = deets.replace(": ", "=")
        deets = deets.replace("input_size=", "")
        deets = deets.replace(", output_size=", " -> ")
        deets = deets.replace("dtype=torch.", "")
        deets = deets.replace("mode=", "")
        deets = deets.replace("channels_last=True, ", "")
        split = deets.split(",")
        size = ','.join(split[:-3])
        mode, dtype, threads = split[-3:]
        deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}"

        l = f"{deets}  {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms"
        out.append(l)

def key(s):
    # s = ''.join(s.split()[1:]) # remove "N.nX" part
    num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),)

    input_shape, output_shape = re.findall("\(.*?\)", s)
    input_shape = input_shape[1:-1]  # remove parenthesis
    input_HW = tuple(int(x) for x in input_shape.split(",")[-2:])
    input_C = (-int(input_shape.split(",")[1]),)

    output_HW = tuple(int(x) for x in output_shape[1:-1].split(","))
    is_downsample = (output_HW[0] < input_HW[0],)
    if "linear" in s:
        mode = "linear"
    elif "nearest-exact" in s:
        mode = "nearest-exact"
    else:
        assert "nearest" in s
        mode = "nearest"
    mode = (mode,)
    return is_downsample + input_HW + output_HW + num_threads + input_C + mode

for i, l in enumerate(sorted(out, key=key)):
    if i % 10 == 0 and i % 40 != 0:
        print()
    if i % 40 == 0:
        print("-" * 100)
    print(l)

```

</details>

Closes https://github.com/pytorch/pytorch/issues/83840

When this is merged we should be able to remove some hack in vision as well https://github.com/pytorch/vision/pull/6661 (CC @vfdev-5 @datumbox )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86361
Approved by: https://github.com/vfdev-5, https://github.com/datumbox, https://github.com/fmassa
2022-10-07 07:52:36 +00:00
70c6a988d6 Fix the performance issue that the for-loop before ExternallCall could not be parallelized. (#85056)
Currently, NNC only parallelizes the loop statement of the graph outputs. The logic could bypass some loop statements that could be parallelized. Take an example as follows and suppose the output of `ExternallCall` is also the output of NNC fusion group. Current [parallel logic](https://github.com/pytorch/pytorch/pull/85056/files#diff-9a11174c26e4b57ab73e819520122bc314467c72962f3a5b79e7400ea3c4bbe5L781-L785) only tries to parallel the `ExternalCall` and bypass `stmt1` and `stmt2`.

```c++
stmt1: For:
stmt2:   For:
stmt3: ExternalCall
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85056
Approved by: https://github.com/frank-wei, https://github.com/bertmaher
2022-10-07 07:36:28 +00:00
2110c89443 Revert "Revert "Revert "SymIntify cat and narrow (#86191)"" (#86289)"
This reverts commit e778fbf5197638d6196c5d5acf6f9588a1e83368.

Reverted https://github.com/pytorch/pytorch/pull/86289 on behalf of https://github.com/seemethere due to Fails internal tests see: https://www.internalfb.com/intern/sandcastle/job/27021598552487548/
2022-10-07 05:20:36 +00:00
eqy
6c604c9262 [CuDNN v8 API][Quantization]fix alignment function in quantized cuDNN V8 path (#86253)
This bug was in the native cuDNN V8 API integration and was fixed a while ago, but the change was never ported here.

Previously the returned alignment could be twice the actual alignment of the data if the alignment was smaller than 16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86253
Approved by: https://github.com/dzdang
2022-10-07 05:13:37 +00:00
455b873919 Introduce a match filter for SubgraphRewriter (#86430)
This PR introduces an interface for user defined function that filters the matches in SubgraphRewriter. The function will have the following signature.

callable(match: InternalMatch, original_graph: Graph, pattern_graph: Graph) -> bool

This filter is applied after SubgraphMatcher returns the matches, and before replacement takes place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86430
Approved by: https://github.com/jerryzh168
2022-10-07 05:09:40 +00:00
b5fd845fdf [torchdynamo hash update] update the pinned torchdynamo hash (#86399)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned torchdynamo hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86399
Approved by: https://github.com/pytorchbot
2022-10-07 04:44:19 +00:00
10aead9adc [MPS] Cache multinomial_with_replacement graph (#86437)
Reuse existing RandomCachedGraph to keep RNG state as part of the graph
Add `CreateCachedGraphAs` convenience wrapper
Addresses https://github.com/pytorch/pytorch/pull/86342#pullrequestreview-1132197848
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86437
Approved by: https://github.com/kulinseth
2022-10-07 04:39:30 +00:00
9ceadcadb2 Fix unfold backward decomp aliasing for 0 dim input (#86428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86428
Approved by: https://github.com/ngimel, https://github.com/ezyang
2022-10-07 03:55:31 +00:00
b14f1d7bb8 Add Skip List for Aten Ops that are fused in nvFuser. (#86101)
This Skip List (tuple) is added under the nvprims context manager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86101
Approved by: https://github.com/jjsjann123, https://github.com/mruberry
2022-10-07 03:55:13 +00:00
c5a4844085 Xformer SDP forward/backward kernel (#86157)
# Summary
Include xformer kernel code and make header updates to successfully build. Need to update the kernel calling code and dispatch system to clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86157
Approved by: https://github.com/cpuhrsch
2022-10-07 03:52:46 +00:00
ca39e3679f [vision hash update] update the pinned vision hash (#86173)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86173
Approved by: https://github.com/pytorchbot
2022-10-07 03:19:31 +00:00
2fec853c87 Fix SubgraphMatcher for case of no anchor found (#86421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86421
Approved by: https://github.com/jerryzh168
2022-10-07 02:05:42 +00:00
b73f0e98d5 Fix cond tests after CI was disabled for a bit (#86321)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86321
Approved by: https://github.com/zou3519
2022-10-07 01:46:51 +00:00
ca69ddb4f7 Fix broadcasting to implicit leading dimensions in torch.where on MPS (#86240)
Fixes #86239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86240
Approved by: https://github.com/kulinseth
2022-10-07 01:38:57 +00:00
0e30da3f2f [refactor] Renaming ao.sparsity to ao.pruning (#84867)
`Sparsity` as a term doesn't reflect the tools that are developed by the AO. The `torch/ao/sparsity` also has utilities for structured pruning, which internally we always referred to as just "pruning". To avoid any confusion, we renamed `Sparsity` to `Prune`. We will not be introducing the backwards compatibility, as so far this toolset was kept under silent development.

This change will reflect the changes in the documentation as well.

**TODO:**
- [ ] Change the tutorials
- [ ] Confirm no bc-breakages
- [ ] Reflect the changes in the trackers and RFC docs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84867
Approved by: https://github.com/supriyar
2022-10-07 00:58:41 +00:00
9a170b24f6 Cleanup PT-D imports (#85781)
Summary:
The flow logic around torch.dist imports results in large number of pyre errors (100's); would be preferable to just raise on importing as opposed to silently fail.

Con: Some percentage (MacOS?) of users may have notebooks that imports PT-D, although would think small, since any attempt to call parts of the library would just fail...

TODO: assuming ok, will remove the 10's-100's of unused pyre ignores no longer required.

Test Plan: existing unit tests

Differential Revision: D39842273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85781
Approved by: https://github.com/mrshenli
2022-10-07 00:29:32 +00:00
a241963837 [nll_loss] Avoid unnecessary type casts (#86086)
follow-up #85395

`AT_DISPATCH_NLL_LOSS_INDEX_TYPES` should not be removed in favor of #59765 and there's a testcase 99ca25e6eb/test/test_nn.py (L16832)

Besides the dispatcher, I wanted to sanity check `int64_t ignore_index` because `int64_t` can be inappropriate considering that `target` can be `Byte`. However, given that the default value is -100 as in 0a75c42f36/aten/src/ATen/native/native_functions.yaml (L9949) it's not easy to add a check while keeping the backward compatibility. Thus I decided to not add a check.

cc @lezcano @t-vi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86086
Approved by: https://github.com/lezcano
2022-10-07 00:10:27 +00:00
2232db7fc1 Replacement is irrelevant for 1-sample multinomial (#86342)
So use fast path, both on CPU and on MPS

Also, remove some spurious copy-n-paste checks from MPS codepath

CUDA already has this optimization, see
dc9c507d24/aten/src/ATen/native/cuda/MultinomialKernel.cu (L355-L356)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86342
Approved by: https://github.com/ngimel
2022-10-07 00:08:42 +00:00
5a8b07de75 Declare public dependencies on libshm (#82694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82694
Approved by: https://github.com/malfet
2022-10-07 00:01:25 +00:00
08e3999fa4 Merge more symbolic meta kernels and symint changes from branch (#86334)
symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops

add meta_bernoulli_

meta kernel for at::gather

get pytorch_struct to pass: meta for scatter_add, fix backward

symintify split ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86334
Approved by: https://github.com/ezyang
2022-10-06 23:29:04 +00:00
3af0eafea6 Release 1.13: Bump nightly version 1.13->1.14 (#86296)
Release 1.13:  Bump nightly version 1.13->1.14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86296
Approved by: https://github.com/seemethere, https://github.com/malfet
2022-10-06 23:26:58 +00:00
5ed75ec1d7 Fix SparseAdam consuming iterator (#86210)
Fixes https://github.com/pytorch/pytorch/issues/86209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86210
Approved by: https://github.com/cpuhrsch
2022-10-06 23:11:25 +00:00
f0977c4658 [FSDP] Doc to explain running submodules (#86343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86343
Approved by: https://github.com/awgu
2022-10-06 23:10:23 +00:00
3db8ddcac1 [FSDP] Fix clip_grad_norm for CPU offload (#86337)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86337
Approved by: https://github.com/awgu
2022-10-06 23:10:23 +00:00
adfd8f3823 [FSDP] assert to runtime error (#86336)
Prefer raising an error over `assert` which should mostly to indicate a developer bug, but user can cause this error path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86336
Approved by: https://github.com/awgu
2022-10-06 23:10:21 +00:00
7a411952fb CheckpointSequential support non-reentrant (#86331)
Closes https://github.com/pytorch/pytorch/issues/86328

Adds `use_reentrant` argument to `checkpoint_sequential`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86331
Approved by: https://github.com/zhaojuanmao, https://github.com/albanD
2022-10-06 23:10:18 +00:00
3037f3d710 Docs: fix typo (#86273)
Typo in torch.fx.Interpreter.fetch_attr docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86273
Approved by: https://github.com/kit1980
2022-10-06 22:38:50 +00:00
233d6f195a Revert "Fix memory leak in _LRScheduler.step() (#85602)"
This reverts commit eb32330d6b3709dc8910eb298d8802fbca57b05c.

Reverted https://github.com/pytorch/pytorch/pull/85602 on behalf of https://github.com/albanD due to newly added test is flaky
2022-10-06 22:02:02 +00:00
bf74679884 Fix for binary upload step, use bash shell rather then default sh (#86382)
This fixes the issue during upload:

```
Run # reference ends with an RC suffix
  # reference ends with an RC suffix
  if [[ ${GITHUB_REF_NAME} = *-rc[0-9]* ]]; then
    echo "UPLOAD_CHANNEL=test" >> "$GITHUB_ENV"
  fi
  shell: sh -e {0}
/__w/_temp/f045f5d8-ddb.sh: 2: [[: not found
```

Test failure:
https://github.com/pytorch/pytorch/actions/runs/3199561387/jobs/5225448559

Test success:
https://github.com/pytorch/pytorch/actions/runs/3199573560/jobs/5225480345

Error started when we switched to: continuumio/miniconda3:4.12.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86382
Approved by: https://github.com/weiwangmeta
2022-10-06 21:55:33 +00:00
facf210f9a [ao] fixing public v private for qconfig.py (#86026)
Summary: no changes, just removed the exception for this file, someone
had already fixed the actual file

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86026
Approved by: https://github.com/jerryzh168
2022-10-06 21:42:44 +00:00
7c5e07f87b [kineto] guard global observer init against Edge profiler (#86347)
Summary:
looks like Sandcastle CI didn't cover any of concrete mobile CI(cc: kimishpatel i'd assume we have ton of mobile tests in Github?). This is failing on Oculus with the similar failure as Mac(not sure if this is an ARM thing). either way on demand tracing should not be enabled on these platforms so disable them completely

in the future, we should have runtime check on this for even safer guarding

Test Plan:
Set up Hollywood via P536072492

## Before
crash on mutex. likely SIOF
```
FORTIFY: pthread_mutex_lock called on a destroyed mutex (0x5d7e298b08)
*** Aborted at 1665017107 (Unix time, try 'date -d 1665017107') ***
*** Signal 6 (SIGABRT) (0xeca) received by PID 3786 (pthread TID 0x785bd1eed0) (linux TID 3786) (maybe from PID 3786, UID 0) (code: -1), stack trace: ***
(error retrieving stack trace)
```

## After
Redacted in the top but the test passes without the crash
P536101962

Differential Revision: D40129840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86347
Approved by: https://github.com/aaronenyeshi
2022-10-06 21:36:15 +00:00
bc919ac796 [torch.ao.quantization] include torch.qint32 for static quant (#86345)
Summary: include `torch.qint32` to `activation_is_statically_quantized` and `get_quant_type` so that fakequantize with `dtype=torch.qint32` won't be skipped

Test Plan: updated `test_custom_module_class`

Differential Revision: D40128178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86345
Approved by: https://github.com/jerryzh168
2022-10-06 20:05:56 +00:00
08780229df Two small improvements to references (#86371)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86371
Approved by: https://github.com/mruberry
2022-10-06 19:31:11 +00:00
795906f207 Add total GPU memory utilization (#86250)
Although we already have per process GPU memory usage, I'm curious to see what is the number for `gpu_utilization.memory` per https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html.  Also fixing a tiny typo issue that has been bugging me for a while `total_gpu_utilizaiton`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86250
Approved by: https://github.com/ZainRizvi
2022-10-06 18:53:59 +00:00
1059d3b52d Make mergebot message clearer when starting a new merge (#86311)
Modifying how the merge started message appears to make it more readable.
Also removing some deprecated v1 land checks messages

Old:
<img width="917" alt="image" src="https://user-images.githubusercontent.com/4468967/194150650-c9e384a3-d13c-40aa-975d-f43853790603.png">

New:
<img width="933" alt="image" src="https://user-images.githubusercontent.com/4468967/194151507-a5900cd5-5711-4cab-9447-c2cc6ed0d7b5.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86311
Approved by: https://github.com/malfet, https://github.com/huydhn
2022-10-06 18:47:07 +00:00
6b295cd046 Enable autograd on Linear with sparse COO weight (#86302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86302
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:31 +00:00
8f2c2167d4 Support autograd on sparse_mm in full. (#86301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86301
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:31 +00:00
88b882cd1c Support sum on a sparse COO tensor. (#86300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86300
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:28 +00:00
f104490d63 Support autograd on Linear with sparse compressed weight. (#86137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86137
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:25 +00:00
fc21cc82fc Enable sparse_dim() and dense_dim() methods for Strided tensors (#86203)
The reason for enabling sparse/dense_dim() for strided tensors is to have more meaningful error messages:
For instance, compare
```
NotImplementedError: Could not run 'aten::sparse_dim' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::sparse_dim' is only available for these backends: [SparseCPU, SparseCUDA, SparseMeta, SparseCsrCPU, SparseCsrCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
```
[master] vs
```
RuntimeError: addmm: matrices expected, got 0D tensor
```
[this PR] where the latter message gives a hint of which function is to blame for dealing with unexpected inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86203
Approved by: https://github.com/cpuhrsch
2022-10-06 18:39:22 +00:00
bed1ece9c5 [torchdynamo hash update] update the pinned torchdynamo hash (#86306)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned torchdynamo hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86306
Approved by: https://github.com/pytorchbot
2022-10-06 17:34:29 +00:00
eb32330d6b Fix memory leak in _LRScheduler.step() (#85602)
Fixes #85410

This diff removed the cyclic references in `_LRScheduler.step()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85602
Approved by: https://github.com/albanD
2022-10-06 17:07:36 +00:00
b8b564c908 Ensure the minimum NVIDIA driver version to be 515.57 for CUDA 11.7 (#86344)
This does 2 things:

* Ensure that `nvidia-driver-latest-dkms` package is removed if it's installed. This allows the installation to go forward without the below error when using the standard installation script from S3:

```
(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
```

* Not skipping the installation if a driver different than `515.57` exists to avoid any unexpected behavior when using a different driver version. This partly addresses the recent issue in https://github.com/pytorch/pytorch/issues/85778 in which `510.60.02` is there instead (not sure from where) and fails CUDA 11.7 test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86344
Approved by: https://github.com/atalman, https://github.com/malfet
2022-10-06 16:47:45 +00:00
0c148a4b5f Remove extra bracket, update header definition (#86317)
Summary: Fix compilation error

Test Plan: Unit test

Reviewed By: malfet, mikaylagawarecki

Differential Revision: D40108369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86317
Approved by: https://github.com/malfet
2022-10-06 16:28:05 +00:00
fb9b96593c Use FindCUDAToolkit to find cuda dependencies (#82695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695
Approved by: https://github.com/malfet
2022-10-06 15:43:39 +00:00
fa799132d8 [MPS] Better error message for slow_conv2d_forward (#86303)
Error `Could not run 'aten::_slow_conv2d_forward' with arguments from the 'MPS' backend.` is very misleading as usually this method is only invoked if input is on CPU but weights are on MPS device.
Raise a more user friendly error in this case

Add test to `test_invalid_conv2d` to check for those conditions.

Fixes https://github.com/pytorch/pytorch/issues/77931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86303
Approved by: https://github.com/kulinseth
2022-10-06 15:38:57 +00:00
4d7728890b Inline asIntArrayRef (#86350)
I was benchmarking and this is worth maybe 5% on at::empty, but it's basically
free so we should do it.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86350
Approved by: https://github.com/albanD
2022-10-06 14:55:03 +00:00
cebf08afb2 [Quant] Remove weight from DTypeConfig for non-weighted ops (#86335)
Summary: Weight dtypes should be specified only for weighted
ops like conv and linear. This commit removes weight dtypes
from the DTypeConfigs used in binary ops and fixed qparams ops.

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Reviewers: jerryzh168, vkuzo

Subscribers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86335
Approved by: https://github.com/vkuzo
2022-10-06 13:30:59 +00:00
cdbffa7f66 🦊 [AI Accelerators] Consolidate native_layer_norm for nested tensor (#86295)
Summary: In order to make the layer normalization implementation for nested tensors public, it needs to be generalized to accept a normalized_shape argument instead of assuming it to be the last dimension of the nested_tensor. This commit does that, as well as adding extra unit tests to ensure the implementation is correct.

Test Plan:
All unit tests designed to test different ways of using the function work:

`buck test //caffe2/test:nested -- test_layer_norm`

Differential Revision: D40105207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86295
Approved by: https://github.com/drisspg
2022-10-06 13:10:25 +00:00
85c3b745f6 Conditionally build the TestApp benchmark based on lite interpreter (#86314)
The TestApp benchmark was recently re-added, however it seems it only builds when pytorch is built with the lite interpreter. This diff adds a macro to compile out the benchmark when pytorch is built as full jit. This should fix our full jit simulator nightly builds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86314
Approved by: https://github.com/malfet
2022-10-06 10:08:54 +00:00
936e93058b Delete torch::deploy from pytorch core (#85953)
As we have migrated torch::deploy over to https://github.com/pytorch/multipy, we can now delete it from pytorch core as ongoing development will happen there.

This PR was created due to syncing issues with https://github.com/pytorch/pytorch/pull/85443 which is where the review history can be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85953
Approved by: https://github.com/seemethere, https://github.com/malfet
2022-10-06 07:20:16 +00:00
27c3fb0386 [Profiler] trace verbose=false by default (#86263)
Summary:
- Added config option to remove 'Call stack' field from trace file (#84982)
- Change default value to `false`

Test Plan:
- `experimental_config=_ExperimentalConfig(verbose=true),` will add 'Call stack' field back in the trace file.
- CI tests

Differential Revision: D40092377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86263
Approved by: https://github.com/aaronenyeshi
2022-10-06 06:32:25 +00:00
a117fde86f [Profiler] Apply TensorMetadata for Optimizer and nnModule (#86047)
Summary: - Use `TensorMetadat` struct in saving tensor info from Optimizer and nnModule.

Test Plan: buck run mode/opt //caffe2/test:profiler

Reviewed By: chaekit

Differential Revision: D39682205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86047
Approved by: https://github.com/chaekit, https://github.com/robieta
2022-10-06 06:18:56 +00:00
fd5085c445 Symintify getitem and add the required helper functions (#86207)
Note that this might not cover every use of the function (we know it doesn't)
But this is enough to get few models passing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86207
Approved by: https://github.com/ezyang, https://github.com/Chillee, https://github.com/bdhirsh
2022-10-06 04:46:19 +00:00
0a75c42f36 Workaround MSVC ICE due to constexpr char* template argument (#86288)
Test Plan:
Lease a Windows sandcastle https://www.internalfb.com/intern/wiki/Windows_Platform_Engineering/Leasable_VM_-_User_Guide/
and run:

```
buck build arvr/mode/win/opt //xplat/caffe2:_C_impl
```

Differential Revision: D40109191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86288
Approved by: https://github.com/albanD, https://github.com/malfet
2022-10-06 04:11:05 +00:00
45f03d6948 Add at::symint:: namespace for ease of templated functions (#86329)
Our prevailing strategy for symbolic shapes in C++ is to only
write the SymInt version of the code, and pay a slight performance
tax from not knowing if it is symbolic or not.  However, there are
some fastpath functions where this tax is unacceptable, and we want
to specialize for the int case.  Sometimes, it is easy to template
the function; but when the function involves Tensors, it is not,
because the functions you may want to call are not templated,
e.g., t.view vs t.view_symint

This PR adds an at::symint:: namespace which contains templated
functions for all functions in PyTorch which you can use in this
way.  To show this works, I refactored sum_to to stop incorrectly
reinterpret casting and instead use a template.  Instead of
t.sizes(), we call at::symint::sizes<T>(t), and so forth.

The template functions are SFINAE'd using a template argument that
is not otherwise used. As such, deduction is impossible. Typically, deduction
is hard anyway, because many of the constructors are ambiguous (this
is why we split foo and foo_symint in the first place). So you must pass
a template argument to these functions.

These functions are codegened into Functions.h so they are subject
to per-operator headers.  This matters most for methods, which likely
didn't include the per-operator header, so you will have to add an
include in that case.  We never generate method variants for these.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86329
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
2022-10-06 04:09:17 +00:00
ea21a982f2 Reduce warning suppression by just disabling pytest warnings plugin (#86255)
Fixes https://github.com/pytorch/pytorch/issues/85626

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86255
Approved by: https://github.com/lezcano, https://github.com/albanD
2022-10-06 04:08:50 +00:00
adf5919720 Add option to record C++ backtraces in _record_memory_history (#86145)
I used this to debug https://github.com/pytorch/pytorch/issues/86136 so it is useful. The implementation is not so fast so it is not enabled by default.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86145
Approved by: https://github.com/albanD, https://github.com/zdevito
2022-10-06 04:07:37 +00:00
97d6b5bbf8 Refactor _cuda_recordMemoryHistory to use pybind11 (#86139)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86139
Approved by: https://github.com/albanD
2022-10-06 04:07:37 +00:00
d04889323e Add Context Manager for Disabling Multithreading in Backwards, use in aot autograd (#86245)
We were running into a few issues with running multithreaded backwards in aot_autograd: such as https://github.com/pytorch/pytorch/issues/86136, and `FakeTensorMode` getting into a weird state as a result of not executing functions completely sequentially. The multithreaded backwards is lost in translation when we trace out the backwards anyway, and adds a lot of additional complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86245
Approved by: https://github.com/albanD, https://github.com/yf225
2022-10-06 03:27:42 +00:00
237316aa1d PNP: early FX numeric suite tool to quantize each layer N times (#80521)
Summary:

This PR is an early prototype of a tool to quantize each layer of a model
N times, with N qconfigs each. We follow the design agreed upon in
https://fburl.com/gdoc/e1gaq3ih .

Current API:

```
m = M().eval()
example_input = (torch.randn(2, 2),)
qconfig_mappings = [
    QConfigMapping().set_global(torch.quantization.default_qconfig),
    QConfigMapping().set_global(torch.quantization.default_dynamic_qconfig),
]
backend_config = get_native_backend_config()

msp = prepare_n_shadows_model(
    m, example_input, qconfig_mappings, backend_config)

for _ in range(2):
    msp(*example_input)

msq = convert_n_shadows_model(msp)
msq(*example_input)

results = extract_results_n_shadows_model(msq)
print_comparisons_n_shadows_model(results)

// example output

subgraph_idx    ref_node_name      best_idx        1        2
--------------  ---------------  ----------  -------  -------
subgraph_0      fc1                       2  42.0834  42.6279
subgraph_1      fc2                       2  43.7259  50.0593
```

Test plan:

```
python test/test_quantization.py -k test_n_shadows
```

Differential Revision: [D37650332](https://our.internmc.facebook.com/intern/diff/D37650332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80521
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
2022-10-06 02:30:45 +00:00
b233d83471 make torch.histc ignore NaNs on CPU (#85870)
Summary: cuda torch.histc already ignores NaNs

Test Plan: unittest added

Differential Revision: D39911272

fix https://github.com/pytorch/pytorch/issues/85853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85870
Approved by: https://github.com/ngimel
2022-10-06 01:09:00 +00:00
ddec1eea05 [Static Runtime] Block linalg_svdvals codegen & run codegen script (#85983)
Summary:
The test is causing issues:
```
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
    graph(%A: Tensor, %driver: str?):
        %bias: None = prim::Constant()
        %ret = aten::linalg_svdvals(%A, %driver)
               ~~~~ <--- HERE
        %cloned = aten::clone(%ret, %bias)
        return (%cloned)
RuntimeError: torch.linalg.svd: keyword argument `driver=` is only supported on CUDA inputs with cuSOLVER backend.
```

Just block the op and re-run the codegen script to remove everything and update the generated ops.

Test Plan: Existing tests

Differential Revision: D39973860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85983
Approved by: https://github.com/xuzhao9, https://github.com/tenpercent
2022-10-06 01:07:40 +00:00
bebd162249 Fix doc of DDP (#86244) (#86256)
[ghstack-poisoned]

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86256
Approved by: https://github.com/rohan-varma
2022-10-06 00:48:56 +00:00
020f2b2c0b add myself for dynamic shapes PR review (#86292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86292
Approved by: https://github.com/albanD
2022-10-06 00:34:34 +00:00
dc9c507d24 add nominal support for int32 indices in index/index_put ops (#86309)
Currently index_select/index_add decompositions decompose to `index` or `index_put` ops. The problem with this is that `index_select` and `index_add` accept int32 indices while `index` doesn't. That leads to error in meta func for those decompositions. This PR adds non-performant support for int32 indices to `index` operations, to allow decompositions go through.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86309
Approved by: https://github.com/lezcano
2022-10-05 23:59:16 +00:00
e8b0bea677 Rename fromIntArrayRef to fromIntArrayRefSlow, audit call sites (#86235)
Some of them are known non-negative, I've revised them accordingly.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86235
Approved by: https://github.com/albanD
2022-10-05 23:11:01 +00:00
168ba066e3 Revert "Symintify getitem and add the required helper functions (#86207)"
This reverts commit 17addb307ee9a4d12ad6918e90358a9a47a4f12b.

Reverted https://github.com/pytorch/pytorch/pull/86207 on behalf of https://github.com/malfet due to Broke lint, by double-registering `meta_index_put`, but no CI was run during the outage
2022-10-05 22:42:56 +00:00
be4e43c7d0 Remove DataParallel remnants from DDP doc (#86221)
As @aazzolini pointed out, the docstring is incorrect and probably vestige from DP / single process multi device mode in DDP.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86221
Approved by: https://github.com/aazzolini
2022-10-05 22:30:02 +00:00
9e1a431220 Mark ctc_loss with dynamic_output_shape (#86293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86293
Approved by: https://github.com/eellison
2022-10-05 22:26:50 +00:00
0e5a27fb8d Fix horribly double truncation bug in Scalar (#86304)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86304
Approved by: https://github.com/albanD
2022-10-05 22:24:17 +00:00
73777d8a2b [ao] fixing public v private for quantization_mappings.py (#86025)
Summary: no significant changes, just added __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86025
Approved by: https://github.com/jerryzh168
2022-10-05 22:12:03 +00:00
28a5cd9480 [ao] fixing public v private for quantize_jit.py (#86024)
Summary: just needed to add __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86024
Approved by: https://github.com/jerryzh168
2022-10-05 22:11:43 +00:00
17addb307e Symintify getitem and add the required helper functions (#86207)
Note that this might not cover every use of the function (we know it doesn't)
But this is enough to get few models passing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86207
Approved by: https://github.com/ezyang
2022-10-05 21:19:00 +00:00
b8895df8db Fix black binary again for debug python (#86275)
The `--no-binary` flag was not ported when moving from black only to ufmt.
This adds it back.

This is to work around the fact that black binary hard crashes when running with debug python and it needs to be compiled from source.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86275
Approved by: https://github.com/bdhirsh, https://github.com/malfet
2022-10-05 21:08:40 +00:00
e778fbf519 Revert "Revert "SymIntify cat and narrow (#86191)"" (#86289)
This reverts commit fc94a2115b31dfe7a0d8f28eb4f5ed532c4f0792.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86289
Approved by: https://github.com/wconstab
2022-10-05 20:51:28 +00:00
089a64e99e Install c10d headers with absolute path (#86257)
https://github.com/pytorch/pytorch/pull/85780 updated all c10d headers in pytorch to use absolute path following the other distributed components. However, the headers were still copied to `${TORCH_INSTALL_INCLUDE_DIR}/torch`, thus external extentions still have to reference the c10d headers as `<c10d/*.h>`, making the usage inconsistent (the only exception was c10d/exception.h, which was copied to `${TORCH_INSTALL_INCLUDE_DIR}/torch/csrc/distributed/c10d`).

This patch fixes the installation step to copy all c10d headers to `${TORCH_INSTALL_INCLUDE_DIR}/torch/csrc/distributed/c10d`, thus external extensions can consistently reference c10d headers with the absolute path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86257
Approved by: https://github.com/kumpera
2022-10-05 20:02:05 +00:00
b67e022833 Fix ref / decomposition index_add (#86266)
The decomposition of `index_add` was using `slice(None)`, when it should
use just `None`.

The reference for index_add was also wrong, as `x[idx] += t` does not
use atomic add, so it does not work when several `idx`s point to the
same location.

This PR adds extra reference inputs to help test for this.

Fixes https://github.com/pytorch/torchdynamo/issues/1356
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86266
Approved by: https://github.com/ngimel
2022-10-05 19:59:15 +00:00
14db44ad72 [ao] fixing public v private for quantize.py (#86023)
Summary: just needed to add __all__

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86023
Approved by: https://github.com/jerryzh168
2022-10-05 19:40:42 +00:00
c21caff876 [ao] correctly set public v private for fake_quantize.py (#86022)
Summary: biggest issue was that the constructors for the fake_quantize
classes use custom partials that live in the observer module and so
the module for these needed to be set correctly in the constructor class
method

Test Plan: python test/test_public_bindings.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86022
Approved by: https://github.com/jerryzh168
2022-10-05 19:30:50 +00:00
3b1ec7511e Optimize is_symbolic test and some refactor (#86230)
Our SymInt rep can be represented more efficiently as just a greater than test, but the compiler doesn't seem to figure it out. Help it out.

There is also some refactoring to simplify the code and add more debugging.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86230
Approved by: https://github.com/albanD
2022-10-05 19:01:36 +00:00
8c6d352bcf Log a new "timer expired" event to Scuba in file_based_local_timer (#85861)
Summary: The "kill worker process" event was logged to Scuba only when the worker process was really reaped. We want to add a new event "timer expired", no matter the worker process will be reaped or not. This will help collect data before we enable the JustKnob to kill the worker process on timeout.

Test Plan:
### Unit Test
```
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
```
```
Test Session: https://www.internalfb.com/intern/testinfra/testrun/7318349508929624
RE: reSessionID-ea464c43-54e7-44f2-942b-14ea8aa98c74  Up: 10.5 KiB  Down: 1.1 MiB
Jobs completed: 100. Time elapsed: 3206.9s. Cache hits: 91%. Commands: 11 (cached: 10, remote: 1, local: 0)
Tests finished: Pass 55. Fail 0. Fatal 0. Skip 0. 0 builds failed
```
--------
```
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/fb:local_agent_fb_internal_test
```
```
Test Session: https://www.internalfb.com/intern/testinfra/testrun/6473924579130483
RE: reSessionID-231a47b7-a43d-4c0f-9f73-64713ffcbbd3  Up: 5.7 MiB  Down: 1.9 GiB
Jobs completed: 182156. Time elapsed: 282.4s. Cache hits: 99%. Commands: 72112 (cached: 72107, remote: 1, local: 4)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. 0 builds failed
```

Differential Revision: D39903376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85861
Approved by: https://github.com/d4l3k
2022-10-05 18:23:53 +00:00
fc94a2115b Revert "SymIntify cat and narrow (#86191)"
This reverts commit 63d8d4f6ec5c973ad7b8669cd39ee9b550e5f55b.

Reverted https://github.com/pytorch/pytorch/pull/86191 on behalf of https://github.com/seemethere due to Fails internal tests, see [D40106464](https://www.internalfb.com/diff/D40106464)
2022-10-05 17:19:55 +00:00
3ec71fce79 Improve make_tensor performance for float and complex types (#85473)
For floating types, `make_tensor` calls `rand` and then does a linear
interpolation from `low` to `high`. This instead calls `uniform_(low,
high)` to cut out the interpolation step.

For complex types, `make_tensor` does the `rand` + interpolation step
twice and calls `torch.complex(real, imag)` at the end. This instead
uses `view_as_real` and `uniform_(low, high)` to fuse it all into one
operation.

My benchmarks show significant speedups in all cases for float32 and
complex64.

| Device | dtype     | Size  | Master (us) | This PR (us) | Speedup |
|--------|-----------|-------|-------------|--------------|---------|
| CPU    | float32   | 8     | 19.4        | 6.34         | 3.1     |
|        |           | 4096  | 36.8        | 21.3         | 1.7     |
|        |           | 2**24 | 167,000     | 80,500       | 2.1     |
|        | complex32 | 8     | 37.0        | 7.57         | 4.9     |
|        |           | 4096  | 73.1        | 37.6         | 1.9     |
|        |           | 2**24 | 409,000     | 161,000      | 2.5     |
| CUDA   | float32   | 8     | 40.4        | 11.7         | 3.5     |
|        |           | 4096  | 38.7        | 11.7         | 3.3     |
|        |           | 2**24 | 2,300       | 238          | 9.7     |
|        | complex32 | 8     | 78.7        | 14           | 5.6     |
|        |           | 4096  | 82.7        | 13.8         | 6.0     |
|        |           | 2**24 | 5,520       | 489          | 11.3    |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85473
Approved by: https://github.com/mruberry
2022-10-05 17:05:20 +00:00
7f607e8cb5 [torchdynamo hash update] update the pinned torchdynamo hash (#85774)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned torchdynamo hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85774
Approved by: https://github.com/pytorchbot, https://github.com/malfet
2022-10-05 17:02:33 +00:00
97d2e1df55 [MPS] Fix GELU for torch.half (#86218)
Also, make sure it raises catcheable errors if invoked with integral types

Otherwise, it used to fail with following fatal error  invoked for `torch.half` and with similar signatures if invoked for integral types
```
loc("mps_multiply"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/4883e71d-37bd-11ed-b0ef-b25c5e9b9057/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<2xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
```

Modified `test_gelu_simple` to check both fwd and backward gradients for gelu
2022-10-05 09:09:17 -07:00
63d8d4f6ec SymIntify cat and narrow (#86191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86191
Approved by: https://github.com/ezyang
2022-10-05 14:46:55 +00:00
0e03dc5f1e Remove softmax from recomputable ops (#86268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86268
Approved by: https://github.com/ezyang
2022-10-05 14:16:53 +00:00
c609768896 Add refs for torch.unfold and a decomposition for its backward. (#85629)
It's not clear to me what's the difference between `unfold` and `unfold_copy`, as this latter one is codegen'd

I also took this chance to clean the implementation of unfold and its reference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85629
Approved by: https://github.com/mruberry
2022-10-05 12:15:49 +00:00
5067 changed files with 485375 additions and 190307 deletions

View File

@ -1,4 +1,4 @@
build --cxxopt=--std=c++14 build --cxxopt=--std=c++17
build --copt=-I. build --copt=-I.
# Bazel does not support including its cc_library targets as system # Bazel does not support including its cc_library targets as system
# headers. We work around this for generated code # headers. We work around this for generated code

View File

@ -28,7 +28,7 @@ fi
# /usr/local/caffe2 is where the cpp bits are installed to in cmake-only # /usr/local/caffe2 is where the cpp bits are installed to in cmake-only
# builds. In +python builds the cpp tests are copied to /usr/local/caffe2 so # builds. In +python builds the cpp tests are copied to /usr/local/caffe2 so
# that the test code in .jenkins/test.sh is the same # that the test code in .ci/test.sh is the same
INSTALL_PREFIX="/usr/local/caffe2" INSTALL_PREFIX="/usr/local/caffe2"
mkdir -p "$gtest_reports_dir" || true mkdir -p "$gtest_reports_dir" || true

View File

@ -149,6 +149,9 @@ export DNNL_MAX_CPU_ISA=AVX2
# Should still run even in the absence of SHARD_NUMBER # Should still run even in the absence of SHARD_NUMBER
if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
# TODO(sdym@meta.com) remove this when the linked issue resolved.
# py is temporary until https://github.com/Teemu/pytest-sugar/issues/241 is fixed
pip install --user py==1.11.0
pip install --user pytest-sugar pip install --user pytest-sugar
# NB: Warnings are disabled because they make it harder to see what # NB: Warnings are disabled because they make it harder to see what
# the actual erroring test is # the actual erroring test is
@ -167,17 +170,3 @@ if [[ "${SHARD_NUMBER:-1}" == "1" ]]; then
"$caffe2_pypath/python" \ "$caffe2_pypath/python" \
"${EXTRA_TESTS[@]}" "${EXTRA_TESTS[@]}"
fi fi
##############
# ONNX tests #
##############
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
pip install -q --user ninja flatbuffers==2.0 numpy==1.21.5 onnxruntime==1.12.1 beartype==0.10.4
# numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.
# We don't actually need it for our tests, but it's imported if it's present, so uninstall.
pip uninstall -q --yes numba
# JIT C++ extensions require ninja, so put it into PATH.
export PATH="/var/lib/jenkins/.local/bin:$PATH"
"$ROOT_DIR/scripts/onnx/test.sh"
fi

View File

@ -33,7 +33,7 @@ function extract_all_from_image_name() {
if [ "x${name}" = xpy ]; then if [ "x${name}" = xpy ]; then
vername=ANACONDA_PYTHON_VERSION vername=ANACONDA_PYTHON_VERSION
fi fi
# skip non-conforming fields such as "pytorch", "linux" or "xenial" without version string # skip non-conforming fields such as "pytorch", "linux" or "bionic" without version string
if [ -n "${name}" ]; then if [ -n "${name}" ]; then
extract_version_from_image_name "${name}" "${vername}" extract_version_from_image_name "${name}" "${vername}"
fi fi
@ -46,11 +46,7 @@ if [[ "$image" == *xla* ]]; then
exit 0 exit 0
fi fi
if [[ "$image" == *-xenial* ]]; then if [[ "$image" == *-bionic* ]]; then
UBUNTU_VERSION=16.04
elif [[ "$image" == *-artful* ]]; then
UBUNTU_VERSION=17.10
elif [[ "$image" == *-bionic* ]]; then
UBUNTU_VERSION=18.04 UBUNTU_VERSION=18.04
elif [[ "$image" == *-focal* ]]; then elif [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04 UBUNTU_VERSION=20.04
@ -77,69 +73,21 @@ if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile" DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then elif [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile" DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter/Dockerfile"
fi fi
if [[ "$image" == *xenial* ]] || [[ "$image" == *bionic* ]]; then # CMake 3.18 is needed to support CUDA17 language variant
CMAKE_VERSION=3.13.5 CMAKE_VERSION=3.18.5
fi
TRAVIS_DL_URL_PREFIX="https://s3.amazonaws.com/travis-python-archives/binaries/ubuntu/14.04/x86_64"
_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab _UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab
_UCC_COMMIT=12944da33f911daf505d9bbc51411233d0ed85e1 _UCC_COMMIT=1c7a7127186e7836f73aafbd7697bbc274a77eee
# It's annoying to rename jobs every time you want to rewrite a # It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it # configuration, so we hardcode everything here rather than do it
# from scratch # from scratch
case "$image" in case "$image" in
pytorch-linux-xenial-py3.8)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
pytorch-linux-xenial-py3.7-gcc7.2)
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=7
# Do not install PROTOBUF, DB, and VISION as a test
;;
pytorch-linux-xenial-py3.7-gcc7)
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-xenial-cuda11.3-cudnn8-py3-gcc7)
CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names
CUDNN_VERSION=8
TENSORRT_VERSION=8.0.1.6
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.3-cudnn8-py3-clang9)
CUDA_VERSION=11.3.0 # Deviating from major.minor to conform to nvidia's Docker image names
CUDNN_VERSION=8
TENSORRT_VERSION=8.0.1.6
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
;;
pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7) pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)
CUDA_VERSION=11.6.2 CUDA_VERSION=11.6.2
CUDNN_VERSION=8 CUDNN_VERSION=8
@ -151,6 +99,7 @@ case "$image" in
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT} UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
;; ;;
pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7) pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)
CUDA_VERSION=11.7.0 CUDA_VERSION=11.7.0
@ -163,45 +112,40 @@ case "$image" in
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT} UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
;; ;;
pytorch-linux-xenial-py3-clang5-asan) pytorch-linux-bionic-cuda11.8-cudnn8-py3-gcc7)
ANACONDA_PYTHON_VERSION=3.7 CUDA_VERSION=11.8.0
CLANG_VERSION=5.0 CUDNN_VERSION=8
PROTOBUF=yes ANACONDA_PYTHON_VERSION=3.10
DB=yes GCC_VERSION=7
VISION=yes
;;
pytorch-linux-xenial-py3-clang7-asan)
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=7
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
;; ;;
pytorch-linux-focal-py3-clang7-asan) pytorch-linux-focal-py3-clang7-asan)
ANACONDA_PYTHON_VERSION=3.7 ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-xenial-py3-clang7-onnx)
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=7 CLANG_VERSION=7
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes
;; ;;
pytorch-linux-focal-py3-clang10-onnx) pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.7 ANACONDA_PYTHON_VERSION=3.8
CLANG_VERSION=10 CLANG_VERSION=10
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes
;; ;;
pytorch-linux-xenial-py3-clang5-android-ndk-r19c) pytorch-linux-focal-py3-clang7-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.7 ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=5.0 CLANG_VERSION=7
LLVMDEV=yes LLVMDEV=yes
PROTOBUF=yes PROTOBUF=yes
ANDROID=yes ANDROID=yes
@ -209,21 +153,25 @@ case "$image" in
GRADLE_VERSION=6.8.3 GRADLE_VERSION=6.8.3
NINJA_VERSION=1.9.0 NINJA_VERSION=1.9.0
;; ;;
pytorch-linux-xenial-py3.7-clang7) pytorch-linux-bionic-py3.8-clang9)
ANACONDA_PYTHON_VERSION=3.7 ANACONDA_PYTHON_VERSION=3.8
CLANG_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-bionic-py3.7-clang9)
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=9 CLANG_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
VULKAN_SDK_VERSION=1.2.162.1 VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes SWIFTSHADER=yes
CONDA_CMAKE=yes
;;
pytorch-linux-bionic-py3.11-clang9)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes
CONDA_CMAKE=yes
;; ;;
pytorch-linux-bionic-py3.8-gcc9) pytorch-linux-bionic-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8 ANACONDA_PYTHON_VERSION=3.8
@ -231,49 +179,36 @@ case "$image" in
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes
;; ;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.7-clang9) pytorch-linux-focal-rocm-n-1-py3)
CUDA_VERSION=10.2 ANACONDA_PYTHON_VERSION=3.8
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.7
CLANG_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-bionic-cuda10.2-cudnn7-py3.9-gcc7)
CUDA_VERSION=10.2
CUDNN_VERSION=7
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-focal-rocm5.1-py3.7)
ANACONDA_PYTHON_VERSION=3.7
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
ROCM_VERSION=5.1.1 ROCM_VERSION=5.3
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
;; ;;
pytorch-linux-focal-rocm5.2-py3.7) pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.7 ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
ROCM_VERSION=5.2 ROCM_VERSION=5.4.2
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
;; ;;
pytorch-linux-focal-py3.7-gcc7) pytorch-linux-focal-py3.8-gcc7)
ANACONDA_PYTHON_VERSION=3.7 ANACONDA_PYTHON_VERSION=3.8
CMAKE_VERSION=3.16.9 # Required for precompiled header support
GCC_VERSION=7 GCC_VERSION=7
PROTOBUF=yes PROTOBUF=yes
DB=yes DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
CONDA_CMAKE=yes
;; ;;
pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12) pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8 ANACONDA_PYTHON_VERSION=3.8
@ -293,6 +228,22 @@ case "$image" in
DB=yes DB=yes
VISION=yes VISION=yes
;; ;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
CUDA_VERSION=11.8
CUDNN_VERSION=8
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
ANACONDA_PYTHON_VERSION=3.9
CONDA_CMAKE=yes
;;
*) *)
# Catch-all for builds that are not hardcoded. # Catch-all for builds that are not hardcoded.
PROTOBUF=yes PROTOBUF=yes
@ -308,6 +259,10 @@ case "$image" in
fi fi
if [[ "$image" == *rocm* ]]; then if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION extract_version_from_image_name rocm ROCM_VERSION
NINJA_VERSION=1.9.0
fi
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
fi fi
if [[ "$image" == *gcc* ]]; then if [[ "$image" == *gcc* ]]; then
extract_version_from_image_name gcc GCC_VERSION extract_version_from_image_name gcc GCC_VERSION
@ -327,12 +282,6 @@ case "$image" in
;; ;;
esac esac
# Set Jenkins UID and GID if running Jenkins
if [ -n "${JENKINS:-}" ]; then
JENKINS_UID=$(id -u jenkins)
JENKINS_GID=$(id -g jenkins)
fi
tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]') tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda #when using cudnn version 8 install it separately from cuda
@ -349,17 +298,12 @@ fi
docker build \ docker build \
--no-cache \ --no-cache \
--progress=plain \ --progress=plain \
--build-arg "TRAVIS_DL_URL_PREFIX=${TRAVIS_DL_URL_PREFIX}" \
--build-arg "BUILD_ENVIRONMENT=${image}" \ --build-arg "BUILD_ENVIRONMENT=${image}" \
--build-arg "PROTOBUF=${PROTOBUF:-}" \ --build-arg "PROTOBUF=${PROTOBUF:-}" \
--build-arg "THRIFT=${THRIFT:-}" \ --build-arg "THRIFT=${THRIFT:-}" \
--build-arg "LLVMDEV=${LLVMDEV:-}" \ --build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "DB=${DB:-}" \ --build-arg "DB=${DB:-}" \
--build-arg "VISION=${VISION:-}" \ --build-arg "VISION=${VISION:-}" \
--build-arg "EC2=${EC2:-}" \
--build-arg "JENKINS=${JENKINS:-}" \
--build-arg "JENKINS_UID=${JENKINS_UID:-}" \
--build-arg "JENKINS_GID=${JENKINS_GID:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \ --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \ --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \ --build-arg "DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}" \
@ -383,6 +327,7 @@ docker build \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \ --build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \ --build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \ --build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \ -f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \ -t "$tmp_tag" \
"$@" \ "$@" \

View File

@ -18,7 +18,6 @@ tag="${DOCKER_TAG}"
registry="308535385114.dkr.ecr.us-east-1.amazonaws.com" registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"
image="${registry}/pytorch/${IMAGE_NAME}" image="${registry}/pytorch/${IMAGE_NAME}"
ghcr_image="ghcr.io/pytorch/ci-image"
login() { login() {
aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' | aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' |
@ -36,9 +35,6 @@ if [[ -z "${GITHUB_ACTIONS}" ]]; then
trap "docker logout ${registry}" EXIT trap "docker logout ${registry}" EXIT
fi fi
# export EC2=1
# export JENKINS=1
# Try to pull the previous image (perhaps we can reuse some layers) # Try to pull the previous image (perhaps we can reuse some layers)
# if [ -n "${last_tag}" ]; then # if [ -n "${last_tag}" ]; then
# docker pull "${image}:${last_tag}" || true # docker pull "${image}:${last_tag}" || true
@ -55,13 +51,6 @@ if [ "${DOCKER_SKIP_PUSH:-true}" = "false" ]; then
if ! docker manifest inspect "${image}:${tag}" >/dev/null 2>/dev/null; then if ! docker manifest inspect "${image}:${tag}" >/dev/null 2>/dev/null; then
docker push "${image}:${tag}" docker push "${image}:${tag}"
fi fi
if [ "${PUSH_GHCR_IMAGE:-}" = "true" ]; then
# Push docker image to the ghcr.io
echo $GHCR_PAT | docker login ghcr.io -u pytorch --password-stdin
docker tag "${image}:${tag}" "${ghcr_image}:${IMAGE_NAME}-${tag}"
docker push "${ghcr_image}:${IMAGE_NAME}-${tag}"
fi
fi fi
if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then

View File

@ -11,14 +11,15 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Install required packages to build Caffe2 # Install required packages to build Caffe2
# Install common dependencies (so that this step can be cached separately) # Install common dependencies (so that this step can be cached separately)
ARG EC2
COPY ./common/install_base.sh install_base.sh COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh RUN bash ./install_base.sh && rm install_base.sh
# Update CentOS git version # Update CentOS git version
RUN yum -y remove git RUN yum -y remove git
RUN yum -y remove git-* RUN yum -y remove git-*
RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm RUN yum -y install https://packages.endpoint.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm || \
(yum -y install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo-1.9-1.x86_64.rpm && \
sed -i "s/packages.endpoint/packages.endpointdev/" /etc/yum.repos.d/endpoint.repo)
RUN yum install -y git RUN yum install -y git
# Install devtoolset # Install devtoolset
@ -38,12 +39,14 @@ COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest) # Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh COPY ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh COPY ./common/common_utils.sh common_utils.sh
RUN rm /opt/conda/requirements-ci.txt RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# (optional) Install protobuf for ONNX # (optional) Install protobuf for ONNX
ARG PROTOBUF ARG PROTOBUF

View File

@ -0,0 +1,32 @@
#!/bin/bash
# Work around bug where devtoolset replaces sudo and breaks it.
if [ -n "$DEVTOOLSET_VERSION" ]; then
export SUDO=/bin/sudo
else
export SUDO=sudo
fi
as_jenkins() {
# NB: unsetting the environment variables works around a conda bug
# https://github.com/conda/conda/issues/6576
# NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation
# NB: This must be run from a directory that jenkins has access to,
# works around https://github.com/conda/conda-package-handling/pull/34
$SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
}
conda_install() {
# Ensure that the install command don't upgrade/downgrade Python
# This should be called as
# conda_install pkg1 pkg2 ... [-c channel]
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION" $*
}
conda_run() {
as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION --no-capture-output $*
}
pip_install() {
as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION pip install --progress-bar off $*
}

View File

@ -68,7 +68,10 @@ install_ubuntu() {
sudo \ sudo \
vim \ vim \
jq \ jq \
libtool libtool \
vim \
unzip \
gdb
# Should resolve issues related to various apt package repository cert issues # Should resolve issues related to various apt package repository cert issues
# see: https://github.com/pytorch/pytorch/issues/65931 # see: https://github.com/pytorch/pytorch/issues/65931
@ -126,7 +129,9 @@ install_centos() {
opencv-devel \ opencv-devel \
sudo \ sudo \
wget \ wget \
vim vim \
unzip \
gdb
# Cleanup # Cleanup
yum clean all yum clean all
@ -152,7 +157,7 @@ esac
# Install Valgrind separately since the apt-get version is too old. # Install Valgrind separately since the apt-get version is too old.
mkdir valgrind_build && cd valgrind_build mkdir valgrind_build && cd valgrind_build
VALGRIND_VERSION=3.16.1 VALGRIND_VERSION=3.20.0
wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2 wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2
tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2 tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2
cd valgrind-${VALGRIND_VERSION} cd valgrind-${VALGRIND_VERSION}

View File

@ -5,7 +5,19 @@ set -ex
[ -n "$CMAKE_VERSION" ] [ -n "$CMAKE_VERSION" ]
# Remove system cmake install so it won't get used instead # Remove system cmake install so it won't get used instead
apt-get remove cmake -y ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
apt-get remove cmake -y
;;
centos)
yum remove cmake -y
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# Turn 3.6.3 into v3.6 # Turn 3.6.3 into v3.6
path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/') path=$(echo "${CMAKE_VERSION}" | sed -e 's/\([0-9].[0-9]\+\).*/v\1/')

View File

@ -24,26 +24,12 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
mkdir -p /opt/conda mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda chown jenkins:jenkins /opt/conda
# Work around bug where devtoolset replaces sudo and breaks it. source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
if [ -n "$DEVTOOLSET_VERSION" ]; then
SUDO=/bin/sudo
else
SUDO=sudo
fi
as_jenkins() {
# NB: unsetting the environment variables works around a conda bug
# https://github.com/conda/conda/issues/6576
# NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation
# NB: This must be run from a directory that jenkins has access to,
# works around https://github.com/conda/conda-package-handling/pull/34
$SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
}
pushd /tmp pushd /tmp
wget -q "${BASE_URL}/${CONDA_FILE}" wget -q "${BASE_URL}/${CONDA_FILE}"
chmod +x "${CONDA_FILE}" # NB: Manually invoke bash per https://github.com/conda/conda/issues/10431
as_jenkins ./"${CONDA_FILE}" -b -f -p "/opt/conda" as_jenkins bash "${CONDA_FILE}" -b -f -p "/opt/conda"
popd popd
# NB: Don't do this, rely on the rpath to get it right # NB: Don't do this, rely on the rpath to get it right
@ -61,24 +47,15 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# as_jenkins conda update -y -n base conda # as_jenkins conda update -y -n base conda
# Install correct Python version # Install correct Python version
as_jenkins conda install -y python="$ANACONDA_PYTHON_VERSION" as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"
conda_install() {
# Ensure that the install command don't upgrade/downgrade Python
# This should be called as
# conda_install pkg1 pkg2 ... [-c channel]
as_jenkins conda install -q -y python="$ANACONDA_PYTHON_VERSION" $*
}
pip_install() {
as_jenkins pip install --progress-bar off $*
}
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README # Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
# DO NOT install cmake here as it would install a version newer than 3.13, but CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
# we want to pin to version 3.13. if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2022.0.1 mkl-include=2022.0.1 setuptools cffi future six" # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
if [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then # TODO: Stop using `-c malfet`
conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0 -c malfet
elif [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0 conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0
elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then
@ -88,8 +65,16 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0 conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0
else else
# Install `typing_extensions` for 3.7 # Install `typing-extensions` for 3.7
conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing_extensions conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing-extensions
fi
# Use conda cmake in some cases. Conda cmake will be newer than our supported
# min version (3.5 for xenial and 3.10 for bionic), so we only do it in those
# following builds that we know should use conda. Specifically, Ubuntu bionic
# and focal cannot find conda mkl with stock cmake, so we need a cmake from conda
if [ -n "${CONDA_CMAKE}" ]; then
conda_install cmake
fi fi
# Magma package names are concatenation of CUDA major and minor ignoring revision # Magma package names are concatenation of CUDA major and minor ignoring revision
@ -98,9 +83,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch
fi fi
# TODO: This isn't working atm
conda_install nnpack -c killeent
# Install some other packages, including those needed for Python test reporting # Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt pip_install -r /opt/conda/requirements-ci.txt

View File

@ -6,9 +6,12 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive" CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive" CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"
curl -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz curl --retry 3 -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz
else else
curl -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
fi fi
tar xf ${CUDNN_NAME}.tar.xz tar xf ${CUDNN_NAME}.tar.xz

View File

@ -7,10 +7,10 @@ if [ -n "$KATEX" ]; then
# Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04) # Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04)
apt-get install -y gpg-agent || : apt-get install -y gpg-agent || :
curl -sL https://deb.nodesource.com/setup_12.x | sudo -E bash - curl --retry 3 -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -
sudo apt-get install -y nodejs sudo apt-get install -y nodejs
curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add - curl --retry 3 -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -
echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list echo "deb https://dl.yarnpkg.com/debian/ stable main" | sudo tee /etc/apt/sources.list.d/yarn.list
apt-get update apt-get update

View File

@ -0,0 +1,29 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
if [ -n "${UBUNTU_VERSION}" ]; then
apt update
apt-get install -y clang doxygen git graphviz nodejs npm libtinfo5
fi
# Do shallow clone of PyTorch so that we can init lintrunner in Docker build context
git clone https://github.com/pytorch/pytorch.git --depth 1
chown -R jenkins pytorch
pushd pytorch
# Install all linter dependencies
pip_install -r requirements.txt
conda_run lintrunner init
# Cache .lintbin directory as part of the Docker image
cp -r .lintbin /tmp
popd
# Node dependencies required by toc linter job
npm install -g markdown-toc
# Cleaning up
rm -rf pytorch

View File

@ -12,7 +12,7 @@ install_protobuf_317() {
# g++: error: ./../lib64/crti.o: No such file or directory # g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64" ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
# -j6 to balance memory usage and speed. # -j6 to balance memory usage and speed.
# naked `-j` seems to use too much memory. # naked `-j` seems to use too much memory.

View File

@ -29,7 +29,12 @@ install_ubuntu() {
if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
# Add amdgpu repository # Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'` UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu" local amdgpu_baseurl
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
else
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/ubuntu"
fi
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
fi fi
@ -38,6 +43,10 @@ install_ubuntu() {
ROCM_REPO="xenial" ROCM_REPO="xenial"
fi fi
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
ROCM_REPO="${UBUNTU_VERSION_NAME}"
fi
# Add rocm repository # Add rocm repository
wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}" local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
@ -78,7 +87,16 @@ install_centos() {
if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then if [[ $(ver $ROCM_VERSION) -ge $(ver 4.5) ]]; then
# Add amdgpu repository # Add amdgpu repository
local amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64" local amdgpu_baseurl
if [[ $OS_VERSION == 9 ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/9.0/main/x86_64"
else
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.3) ]]; then
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/rhel/7.9/main/x86_64"
else
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${AMDGPU_VERSIONS[$ROCM_VERSION]}/rhel/7.9/main/x86_64"
fi
fi
echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo echo "[AMDGPU]" > /etc/yum.repos.d/amdgpu.repo
echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo echo "name=AMDGPU" >> /etc/yum.repos.d/amdgpu.repo
echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo echo "baseurl=${amdgpu_baseurl}" >> /etc/yum.repos.d/amdgpu.repo

View File

@ -23,7 +23,7 @@ done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition # hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc) make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
popd popd
mv magma /opt/rocm mv magma /opt/rocm

View File

@ -22,5 +22,12 @@ chown jenkins:jenkins /usr/local
# TODO: Maybe we shouldn't # TODO: Maybe we shouldn't
echo 'jenkins ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/jenkins echo 'jenkins ALL=(ALL) NOPASSWD:ALL' > /etc/sudoers.d/jenkins
# Work around bug where devtoolset replaces sudo and breaks it.
if [ -n "$DEVTOOLSET_VERSION" ]; then
SUDO=/bin/sudo
else
SUDO=sudo
fi
# Test that sudo works # Test that sudo works
sudo -u jenkins sudo -v $SUDO -u jenkins $SUDO -v

View File

@ -0,0 +1,34 @@
ARG UBUNTU_VERSION
FROM ubuntu:${UBUNTU_VERSION}
ARG UBUNTU_VERSION
ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Note that Docker build forbids copying file outside the build context
COPY ./common/install_linter.sh install_linter.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_linter.sh
RUN rm install_linter.sh common_utils.sh
USER jenkins
CMD ["bash"]

View File

@ -36,11 +36,6 @@ flatbuffers==2.0
#Pinned versions: 2.0 #Pinned versions: 2.0
#test that import: #test that import:
#future #this breaks linux-bionic-rocm4.5-py3.7
#Description: compatibility layer between python 2 and python 3
#Pinned versions:
#test that import:
hypothesis==5.35.1 hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests #Description: advanced library for generating parametrized tests
@ -52,7 +47,7 @@ junitparser==2.1.1
#Pinned versions: 2.1.1 #Pinned versions: 2.1.1
#test that import: #test that import:
librosa>=0.6.2 librosa>=0.6.2 ; python_version < "3.11"
#Description: A python package for music and audio analysis #Description: A python package for music and audio analysis
#Pinned versions: >=0.6.2 #Pinned versions: >=0.6.2
#test that import: test_spectral_ops.py #test that import: test_spectral_ops.py
@ -159,8 +154,13 @@ pytest-shard
#Pinned versions: #Pinned versions:
#test that import: #test that import:
pytest-flakefinder==1.1.0
#Description: plugin for rerunning tests a fixed number of times in pytest
#Pinned versions: 1.1.0
#test that import:
pytest-rerunfailures pytest-rerunfailures
#Description: plugin for rerunning tests in pytest #Description: plugin for rerunning failure tests in pytest
#Pinned versions: #Pinned versions:
#test that import: #test that import:
@ -174,9 +174,9 @@ pytest-rerunfailures
#Pinned versions: #Pinned versions:
#test that import: #test that import:
xdoctest==1.0.2 xdoctest==1.1.0
#Description: runs doctests in pytest #Description: runs doctests in pytest
#Pinned versions: 1.0.2 #Pinned versions: 1.1.0
#test that import: #test that import:
pygments==2.12.0 pygments==2.12.0
@ -211,6 +211,7 @@ scikit-image
scipy==1.6.3 ; python_version < "3.10" scipy==1.6.3 ; python_version < "3.10"
scipy==1.8.1 ; python_version == "3.10" scipy==1.8.1 ; python_version == "3.10"
scipy==1.9.3 ; python_version == "3.11"
# Pin SciPy because of failing distribution tests (see #60347) # Pin SciPy because of failing distribution tests (see #60347)
#Description: scientific python #Description: scientific python
#Pinned versions: 1.6.3 #Pinned versions: 1.6.3
@ -242,3 +243,18 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#Description: saves unit test results to xml #Description: saves unit test results to xml
#Pinned versions: #Pinned versions:
#test that import: #test that import:
lintrunner==0.9.2
#Description: all about linters
#Pinned versions: 0.9.2
#test that import:
rockset==1.0.3
#Description: queries Rockset
#Pinned versions: 1.0.3
#test that import:
ghstack==0.7.1
#Description: ghstack tool
#Pinned versions: 0.7.1
#test that import:

View File

@ -10,7 +10,6 @@ ARG CUDA_VERSION
ENV DEBIAN_FRONTEND noninteractive ENV DEBIAN_FRONTEND noninteractive
# Install common dependencies (so that this step can be cached separately) # Install common dependencies (so that this step can be cached separately)
ARG EC2
COPY ./common/install_base.sh install_base.sh COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh RUN bash ./install_base.sh && rm install_base.sh
@ -24,12 +23,14 @@ COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest) # Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh COPY ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh COPY ./common/common_utils.sh common_utils.sh
RUN rm /opt/conda/requirements-ci.txt RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc # Install gcc
ARG GCC_VERSION ARG GCC_VERSION

View File

@ -11,7 +11,6 @@ ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH} ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Install common dependencies (so that this step can be cached separately) # Install common dependencies (so that this step can be cached separately)
ARG EC2
COPY ./common/install_base.sh install_base.sh COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh RUN bash ./install_base.sh && rm install_base.sh
@ -26,12 +25,14 @@ COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh RUN bash ./install_user.sh && rm install_user.sh
# Install conda and other packages (e.g., numpy, pytest) # Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh COPY ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh COPY ./common/common_utils.sh common_utils.sh
RUN rm /opt/conda/requirements-ci.txt RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc # Install gcc
ARG GCC_VERSION ARG GCC_VERSION

View File

@ -9,7 +9,6 @@ ENV DEBIAN_FRONTEND noninteractive
ARG CLANG_VERSION ARG CLANG_VERSION
# Install common dependencies (so that this step can be cached separately) # Install common dependencies (so that this step can be cached separately)
ARG EC2
COPY ./common/install_base.sh install_base.sh COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh RUN bash ./install_base.sh && rm install_base.sh
@ -35,12 +34,14 @@ COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest) # Install conda and other packages (e.g., numpy, pytest)
ENV PATH /opt/conda/bin:$PATH
ARG ANACONDA_PYTHON_VERSION ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh COPY ./common/install_conda.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh COPY ./common/common_utils.sh common_utils.sh
RUN rm /opt/conda/requirements-ci.txt RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc # Install gcc
ARG GCC_VERSION ARG GCC_VERSION
@ -136,10 +137,6 @@ RUN rm install_openssl.sh
# Install ccache/sccache (do this last, so we get priority in PATH) # Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH ENV PATH /opt/cache/bin:$PATH
# See https://github.com/pytorch/pytorch/issues/82174
# TODO(sdym@fb.com):
# check if this is needed after full off Xenial migration
ENV CARGO_NET_GIT_FETCH_WITH_CLI true
RUN bash ./install_cache.sh && rm install_cache.sh RUN bash ./install_cache.sh && rm install_cache.sh
# Add jni.h for java host build # Add jni.h for java host build

14
.ci/onnx/README.md Normal file
View File

@ -0,0 +1,14 @@
# Jenkins
The scripts in this directory are the entrypoint for testing ONNX exporter.
The environment variable `BUILD_ENVIRONMENT` is expected to be set to
the build environment you intend to test. It is a hint for the build
and test scripts to configure Caffe2 a certain way and include/exclude
tests. Docker images, they equal the name of the image itself. For
example: `py2-cuda9.0-cudnn7-ubuntu16.04`. The Docker images that are
built on Jenkins and are used in triggered builds already have this
environment variable set in their manifest. Also see
`./docker/jenkins/*/Dockerfile` and search for `BUILD_ENVIRONMENT`.
Our Jenkins installation is located at https://ci.pytorch.org/jenkins/.

19
.ci/onnx/common.sh Normal file
View File

@ -0,0 +1,19 @@
set -ex
LOCAL_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)
ROOT_DIR=$(cd "$LOCAL_DIR"/../.. && pwd)
TEST_DIR="$ROOT_DIR/test"
pytest_reports_dir="${TEST_DIR}/test-reports/python"
# Figure out which Python to use
PYTHON="$(which python)"
if [[ "${BUILD_ENVIRONMENT}" =~ py((2|3)\.?[0-9]?\.?[0-9]?) ]]; then
PYTHON=$(which "python${BASH_REMATCH[1]}")
fi
if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
# HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors
unset HIP_PLATFORM
fi
mkdir -p "$pytest_reports_dir" || true

74
.ci/onnx/test.sh Executable file
View File

@ -0,0 +1,74 @@
#!/bin/bash
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
pip install click mock tabulate networkx==2.0
pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
fi
# Skip tests in environments where they are not built/applicable
if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
echo 'Skipping tests'
exit 0
fi
if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
fi
# These additional packages are needed for circleci ROCm builds.
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by
# defaults installs the most recent networkx version, so we install this lower
# version explicitly before scikit-image pulls it in as a dependency
pip install networkx==2.0
# click - onnx
pip install --progress-bar off click protobuf tabulate virtualenv mock typing-extensions
fi
################################################################################
# Python tests #
################################################################################
if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
exit 0
fi
# If pip is installed as root, we must use sudo.
# CircleCI docker images could install conda as jenkins user, or use the OS's python package.
PIP=$(which pip)
PIP_USER=$(stat --format '%U' $PIP)
CURRENT_USER=$(id -u -n)
if [[ "$PIP_USER" = root && "$CURRENT_USER" != root ]]; then
MAYBE_SUDO=sudo
fi
# Uninstall pre-installed hypothesis and coverage to use an older version as newer
# versions remove the timeout parameter from settings which ideep/conv_transpose_test.py uses
$MAYBE_SUDO pip -q uninstall -y hypothesis
$MAYBE_SUDO pip -q uninstall -y coverage
# "pip install hypothesis==3.44.6" from official server is unreliable on
# CircleCI, so we host a copy on S3 instead
$MAYBE_SUDO pip -q install attrs==18.1.0 -f https://s3.amazonaws.com/ossci-linux/wheels/attrs-18.1.0-py2.py3-none-any.whl
$MAYBE_SUDO pip -q install coverage==4.5.1 -f https://s3.amazonaws.com/ossci-linux/wheels/coverage-4.5.1-cp36-cp36m-macosx_10_12_x86_64.whl
$MAYBE_SUDO pip -q install hypothesis==4.57.1
##############
# ONNX tests #
##############
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
pip install -q --user transformers==4.25.1
pip install -q --user ninja flatbuffers==2.0 numpy==1.22.4 onnxruntime==1.14.0 beartype==0.10.4
# TODO: change this when onnx 1.13.1 is released.
pip install --no-use-pep517 'onnx @ git+https://github.com/onnx/onnx@e192ba01e438d22ca2dedd7956e28e3551626c91'
# TODO: change this when onnx-script is on testPypi
pip install 'onnx-script @ git+https://github.com/microsoft/onnx-script@a71e35bcd72537bf7572536ee57250a0c0488bf6'
# numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.
# We don't actually need it for our tests, but it's imported if it's present, so uninstall.
pip uninstall -q --yes numba
# JIT C++ extensions require ninja, so put it into PATH.
export PATH="/var/lib/jenkins/.local/bin:$PATH"
"$ROOT_DIR/scripts/onnx/test.sh"
fi

View File

@ -10,7 +10,7 @@ it is very easy to run these tests yourself:
``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``, ``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,
where ``$BUILD_ENVIRONMENT`` is one of the build environments where ``$BUILD_ENVIRONMENT`` is one of the build environments
enumerated in enumerated in
[pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.circleci/docker/build.sh). The dockerfile used by jenkins can be found under the `.circle` [directory](https://github.com/pytorch/pytorch/blob/master/.circleci/docker) [pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)
2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and 2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and
run one of the scripts in this directory. run one of the scripts in this directory.

View File

@ -26,7 +26,7 @@ CC="clang" CXX="clang++" LDSHARED="clang --shared" \
CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \ CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \
USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \ USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \
python setup.py bdist_wheel python setup.py bdist_wheel
python -mpip install "$(echo dist/*.whl)[opt-einsum]" pip_install_whl "$(echo dist/*.whl)"
# Test building via the sdist source tarball # Test building via the sdist source tarball
python setup.py sdist python setup.py sdist

29
.ci/pytorch/build-tsan.sh Executable file
View File

@ -0,0 +1,29 @@
#!/bin/bash
# Required environment variable: $BUILD_ENVIRONMENT
# (This is set by default in the Docker images we build, so you don't
# need to set it yourself.
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
echo "Clang version:"
clang --version
python tools/stats/export_test_times.py
if [ -n "$(which conda)" ]; then
export CMAKE_PREFIX_PATH=/opt/conda
fi
CC="clang" CXX="clang++" LDSHARED="clang --shared" \
CFLAGS="-fsanitize=thread" \
USE_TSAN=1 USE_CUDA=0 USE_MKLDNN=0 \
python setup.py bdist_wheel
pip_install_whl "$(echo dist/*.whl)"
print_sccache_stats
assert_git_not_dirty

View File

@ -15,14 +15,12 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang7-asan* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@" exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"
fi fi
if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then if [[ "$BUILD_ENVIRONMENT" == *-clang7-tsan* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@" exec "$(dirname "${BASH_SOURCE[0]}")/build-tsan.sh" "$@"
fi fi
if [[ "$BUILD_ENVIRONMENT" == *deploy* ]]; then if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
# Enabling DEPLOY build (embedded torch python interpreter, experimental) exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
# only on one config for now, can expand later
export USE_DEPLOY=ON
fi fi
echo "Python version:" echo "Python version:"
@ -43,8 +41,6 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
fi fi
if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
# enable split torch_cuda build option in CMake
export BUILD_SPLIT_CUDA=ON
if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then
# TODO: there is a linking issue when building with UCC using clang, # TODO: there is a linking issue when building with UCC using clang,
# disable it for now and to be fix later. # disable it for now and to be fix later.
@ -53,7 +49,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
fi fi
fi fi
if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* || ${BUILD_ENVIRONMENT} == *"onnx"* ]]; then if [[ ${BUILD_ENVIRONMENT} == *"caffe2"* ]]; then
echo "Caffe2 build is ON"
export BUILD_CAFFE2=ON export BUILD_CAFFE2=ON
fi fi
@ -64,9 +61,6 @@ elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export ATEN_THREADING=NATIVE export ATEN_THREADING=NATIVE
fi fi
# TODO: Don't run this...
pip_install -r requirements.txt || true
# Enable LLVM dependency for TensorExpr testing # Enable LLVM dependency for TensorExpr testing
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
export USE_LLVM=/opt/rocm/llvm export USE_LLVM=/opt/rocm/llvm
@ -76,13 +70,11 @@ else
export LLVM_DIR=/opt/llvm/lib/cmake/llvm export LLVM_DIR=/opt/llvm/lib/cmake/llvm
fi fi
# TODO: Don't install this here
if ! which conda; then if ! which conda; then
# In ROCm CIs, we are doing cross compilation on build machines with # In ROCm CIs, we are doing cross compilation on build machines with
# intel cpu and later run tests on machines with amd cpu. # intel cpu and later run tests on machines with amd cpu.
# Also leave out two builds to make sure non-mkldnn builds still work. # Also leave out two builds to make sure non-mkldnn builds still work.
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
pip_install mkl mkl-devel
export USE_MKLDNN=1 export USE_MKLDNN=1
else else
export USE_MKLDNN=0 export USE_MKLDNN=0
@ -191,17 +183,8 @@ if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build* ]]; then
export USE_GLOO_WITH_OPENSSL=ON export USE_GLOO_WITH_OPENSSL=ON
fi fi
# TODO: Remove after xenial->focal migration if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3* ]]; then export BUILD_STATIC_RUNTIME_BENCHMARK=ON
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
fi
if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-focal-py3* ]]; then
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
fi fi
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
@ -209,9 +192,14 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
get_bazel get_bazel
tools/bazel build --config=no-tty //... # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"
BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"
tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...
# Build torch, the Python module, and tests for CPU-only # Build torch, the Python module, and tests for CPU-only
tools/bazel build --config=no-tty --config=cpu-only :torch :_C.so :all_tests tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" --config=cpu-only :torch :_C.so :all_tests
else else
# check that setup.py would fail with bad arguments # check that setup.py would fail with bad arguments
@ -232,7 +220,7 @@ else
else else
python setup.py bdist_wheel python setup.py bdist_wheel
fi fi
python -mpip install "$(echo dist/*.whl)[opt-einsum]" pip_install_whl "$(echo dist/*.whl)"
# TODO: I'm not sure why, but somehow we lose verbose commands # TODO: I'm not sure why, but somehow we lose verbose commands
set -x set -x
@ -304,6 +292,13 @@ else
else else
# Test no-Python build # Test no-Python build
echo "Building libtorch" echo "Building libtorch"
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
# NB: Install outside of source directory (at the same level as the root # NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push. # pytorch folder) so that it doesn't get cleaned away prior to docker push.
BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py BUILD_LIBTORCH_PY=$PWD/tools/build_libtorch.py

View File

@ -3,8 +3,8 @@
# This script can also be used to test whether your diff changes any codegen output. # This script can also be used to test whether your diff changes any codegen output.
# #
# Run it before and after your change: # Run it before and after your change:
# .jenkins/pytorch/codegen-test.sh <baseline_output_dir> # .ci/pytorch/codegen-test.sh <baseline_output_dir>
# .jenkins/pytorch/codegen-test.sh <test_output_dir> # .ci/pytorch/codegen-test.sh <test_output_dir>
# #
# Then run diff to compare the generated files: # Then run diff to compare the generated files:
# diff -Naur <baseline_output_dir> <test_output_dir> # diff -Naur <baseline_output_dir> <test_output_dir>

View File

@ -0,0 +1,58 @@
#!/bin/bash
# Required environment variables:
# $BUILD_ENVIRONMENT (should be set by your Docker image)
if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
# Save the absolute path in case later we chdir (as occurs in the gpu perf test)
script_dir="$( cd "$(dirname "${BASH_SOURCE[0]}")" || exit ; pwd -P )"
if which sccache > /dev/null; then
# Save sccache logs to file
sccache --stop-server > /dev/null 2>&1 || true
rm -f ~/sccache_error.log || true
function sccache_epilogue() {
echo "::group::Sccache Compilation Log"
echo '=================== sccache compilation log ==================='
python "$script_dir/print_sccache_log.py" ~/sccache_error.log 2>/dev/null || true
echo '=========== If your build fails, please take a look at the log above for possible reasons ==========='
sccache --show-stats
sccache --stop-server || true
echo "::endgroup::"
}
# Register the function here so that the error log can be printed even when
# sccache fails to start, i.e. timeout error
trap_add sccache_epilogue EXIT
if [[ -n "${SKIP_SCCACHE_INITIALIZATION:-}" ]]; then
# sccache --start-server seems to hang forever on self hosted runners for GHA
# so let's just go ahead and skip the --start-server altogether since it seems
# as though sccache still gets used even when the sscache server isn't started
# explicitly
echo "Skipping sccache server initialization, setting environment variables"
export SCCACHE_IDLE_TIMEOUT=1200
export SCCACHE_ERROR_LOG=~/sccache_error.log
export RUST_LOG=sccache::server=error
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=0 sccache --start-server
else
# increasing SCCACHE_IDLE_TIMEOUT so that extension_backend_test.cpp can build after this PR:
# https://github.com/pytorch/pytorch/pull/16645
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=1200 RUST_LOG=sccache::server=error sccache --start-server
fi
# Report sccache stats for easier debugging
sccache --zero-stats
fi
if which ccache > /dev/null; then
# Report ccache stats for easier debugging
ccache --zero-stats
ccache --show-stats
function ccache_epilogue() {
ccache --show-stats
}
trap_add ccache_epilogue EXIT
fi
fi

28
.ci/pytorch/common.sh Normal file
View File

@ -0,0 +1,28 @@
#!/bin/bash
# Common setup for all Jenkins scripts
# shellcheck source=./common_utils.sh
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
set -ex
# Required environment variables:
# $BUILD_ENVIRONMENT (should be set by your Docker image)
# Figure out which Python to use for ROCm
if [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
# HIP_PLATFORM is auto-detected by hipcc; unset to avoid build errors
unset HIP_PLATFORM
export PYTORCH_TEST_WITH_ROCM=1
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
# improve rccl performance for distributed tests
export HSA_FORCE_FINE_GRAIN_PCIE=1
fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# shellcheck disable=SC2034
BUILD_TEST_LIBTORCH=0
retry () {
"$@" || (sleep 1 && "$@") || (sleep 2 && "$@")
}

View File

@ -9,6 +9,10 @@ log() { printf '%s\n' "$*"; }
error() { log "ERROR: $*" >&2; } error() { log "ERROR: $*" >&2; }
fatal() { error "$@"; exit 1; } fatal() { error "$@"; exit 1; }
retry () {
"$@" || (sleep 10 && "$@") || (sleep 20 && "$@") || (sleep 40 && "$@")
}
# compositional trap taken from https://stackoverflow.com/a/7287873/23845 # compositional trap taken from https://stackoverflow.com/a/7287873/23845
# appends a command to a trap # appends a command to a trap
# #
@ -49,6 +53,12 @@ function assert_git_not_dirty() {
fi fi
} }
function pip_install_whl() {
# This is used to install PyTorch and other build artifacts wheel locally
# without using any network connection
python3 -mpip install --no-index --no-deps "$@"
}
function pip_install() { function pip_install() {
# retry 3 times # retry 3 times
# old versions of pip don't have the "--progress-bar" flag # old versions of pip don't have the "--progress-bar" flag
@ -72,12 +82,12 @@ function get_exit_code() {
function get_bazel() { function get_bazel() {
if [[ $(uname) == "Darwin" ]]; then if [[ $(uname) == "Darwin" ]]; then
# download bazel version # download bazel version
curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64 -Lo tools/bazel retry curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64 -Lo tools/bazel
# verify content # verify content
echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c tools/bazel' | shasum -a 256 -c >/dev/null echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c tools/bazel' | shasum -a 256 -c >/dev/null
else else
# download bazel version # download bazel version
curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel retry curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel
# verify content # verify content
echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c tools/bazel' | shasum -a 256 -c >/dev/null echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c tools/bazel' | shasum -a 256 -c >/dev/null
fi fi
@ -95,25 +105,21 @@ function get_pinned_commit() {
cat .github/ci_commit_pins/"${1}".txt cat .github/ci_commit_pins/"${1}".txt
} }
function install_torchtext() {
local commit
commit=$(get_pinned_commit text)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${commit}"
}
function install_torchvision() { function install_torchvision() {
local commit local commit
commit=$(get_pinned_commit vision) commit=$(get_pinned_commit vision)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}" pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"
} }
function checkout_install_torchvision() {
local commit
commit=$(get_pinned_commit vision)
git clone https://github.com/pytorch/vision
pushd vision
git checkout "${commit}"
time python setup.py install
popd
}
function clone_pytorch_xla() { function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git git clone --recursive -b r2.0 --quiet https://github.com/pytorch/xla.git
pushd xla pushd xla
# pin the xla hash so that we don't get broken by changes to xla # pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)" git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"
@ -123,24 +129,103 @@ function clone_pytorch_xla() {
fi fi
} }
function install_torchdynamo() { function install_filelock() {
local commit pip_install filelock
commit=$(get_pinned_commit torchdynamo)
pip_install --user "git+https://github.com/pytorch/torchdynamo.git@${commit}"
} }
function checkout_install_torchdynamo() { function install_triton() {
local commit local commit
commit=$(get_pinned_commit torchdynamo) commit=$(get_pinned_commit triton)
local short_hash
short_hash=$(echo "${commit}"|cut -c -10)
local index_url
index_url=https://download.pytorch.org/whl/nightly/cpu
if [[ "${TEST_CONFIG}" == *rocm* ]]; then
echo "skipping triton due to rocm"
elif pip install "pytorch-triton==2.0.0+${short_hash}" --index-url "${index_url}"; then
echo "Using prebuilt version ${short_hash}"
else
if [[ "${BUILD_ENVIRONMENT}" == *gcc7* ]]; then
# Trition needs gcc-9 to build
sudo apt-get install -y g++-9
CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
elif [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
# Trition needs <filesystem> which surprisingly is not available with clang-9 toolchain
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt-get install -y g++-9
CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
else
pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
fi
pip_install --user jinja2
fi
}
function setup_torchdeploy_deps(){
conda install -y -n "py_${ANACONDA_PYTHON_VERSION}" "libpython-static=${ANACONDA_PYTHON_VERSION}"
local CC
local CXX
CC="$(which gcc)"
CXX="$(which g++)"
export CC
export CXX
pip install --upgrade pip
}
function checkout_install_torchdeploy() {
local commit
commit=$(get_pinned_commit multipy)
setup_torchdeploy_deps
pushd .. pushd ..
git clone https://github.com/pytorch/torchdynamo git clone --recurse-submodules https://github.com/pytorch/multipy.git
pushd torchdynamo pushd multipy
git checkout "${commit}" git checkout "${commit}"
time python setup.py develop python multipy/runtime/example/generate_examples.py
pip install -e . --install-option="--cudatests"
popd popd
popd popd
} }
function test_torch_deploy(){
pushd ..
pushd multipy
./multipy/runtime/build/test_deploy
./multipy/runtime/build/test_deploy_gpu
popd
popd
}
function install_huggingface() {
local commit
commit=$(get_pinned_commit huggingface)
pip_install pandas
pip_install scipy
pip_install "git+https://github.com/huggingface/transformers.git@${commit}#egg=transformers"
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
}
function checkout_install_torchbench() {
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout no_torchaudio
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
else
# Occasionally the installation may fail on one model but it is ok to continue
# to install and test other models
python install.py --continue_on_fail
fi
popd
}
function test_functorch() { function test_functorch() {
python test/run_test.py --functorch --verbose python test/run_test.py --functorch --verbose
} }

View File

@ -35,11 +35,13 @@ fi
cross_compile_arm64() { cross_compile_arm64() {
# Cross compilation for arm64 # Cross compilation for arm64
USE_DISTRIBUTED=1 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
} }
compile_x86_64() { compile_x86_64() {
USE_DISTRIBUTED=1 WERROR=1 python setup.py bdist_wheel USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel
} }
build_lite_interpreter() { build_lite_interpreter() {

14
.ci/pytorch/macos-common.sh Executable file
View File

@ -0,0 +1,14 @@
#!/bin/bash
# Common prelude for macos-build.sh and macos-test.sh
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
sysctl -a | grep machdep.cpu
# These are required for both the build job and the test job.
# In the latter to test cpp extensions.
export MACOSX_DEPLOYMENT_TARGET=10.9
export CXX=clang++
export CC=clang

View File

@ -4,40 +4,9 @@
# shellcheck source=./macos-common.sh # shellcheck source=./macos-common.sh
source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh" source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"
conda install -y six if [[ -n "$CONDA_ENV" ]]; then
if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then # Use binaries under conda environment
pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba==0.56.0" psutil "scipy==1.9.0" export PATH="$CONDA_ENV/bin":$PATH
else
pip install hypothesis "expecttest==0.1.3" "librosa>=0.6.2" "numba<=0.49.1" psutil "scipy==1.6.3"
fi
# TODO move this to docker
# Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
pip install "unittest-xml-reporting<=3.2.0,>=2.0.0" \
pytest \
pytest-xdist \
pytest-shard \
pytest-rerunfailures \
"xdoctest==1.0.2" \
"pygments==2.12.0" \
"opt-einsum>=3.3"
if [ -z "${CI}" ]; then
rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*
fi
export CMAKE_PREFIX_PATH=${WORKSPACE_DIR}/miniconda3/
# Test PyTorch
if [ -z "${CI}" ]; then
export DEVELOPER_DIR=/Applications/Xcode9.app/Contents/Developer
fi
# Download torch binaries in the test jobs
if [ -z "${CI}" ]; then
rm -rf "${WORKSPACE_DIR}"/miniconda3/lib/python3.6/site-packages/torch*
aws s3 cp s3://ossci-macos-build/pytorch/"${IMAGE_COMMIT_TAG}".7z "${IMAGE_COMMIT_TAG}".7z
7z x "${IMAGE_COMMIT_TAG}".7z -o"${WORKSPACE_DIR}/miniconda3/lib/python3.6/site-packages"
fi fi
# Test that OpenMP is enabled for non-arm64 build # Test that OpenMP is enabled for non-arm64 build
@ -113,13 +82,34 @@ test_libtorch() {
fi fi
} }
print_cmake_info() {
CMAKE_EXEC=$(which cmake)
echo "$CMAKE_EXEC"
CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")
# Print all libraries under cmake rpath for debugging
ls -la "$CONDA_INSTALLATION_DIR/../lib"
export CMAKE_EXEC
# Explicitly add conda env lib folder to cmake rpath to address the flaky issue
# where cmake dependencies couldn't be found. This seems to point to how conda
# links $CMAKE_EXEC to its package cache when cloning a new environment
install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true
# Adding the rpath will invalidate cmake signature, so signing it again here
# to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
# with an exit code 137 otherwise
codesign -f -s - "${CMAKE_EXEC}" || true
}
test_custom_backend() { test_custom_backend() {
print_cmake_info
echo "Testing custom backends" echo "Testing custom backends"
pushd test/custom_backend pushd test/custom_backend
rm -rf build && mkdir build rm -rf build && mkdir build
pushd build pushd build
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')" SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake .. CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" "${CMAKE_EXEC}" ..
make VERBOSE=1 make VERBOSE=1
popd popd
@ -134,13 +124,15 @@ test_custom_backend() {
} }
test_custom_script_ops() { test_custom_script_ops() {
print_cmake_info
echo "Testing custom script operators" echo "Testing custom script operators"
pushd test/custom_operator pushd test/custom_operator
# Build the custom operator library. # Build the custom operator library.
rm -rf build && mkdir build rm -rf build && mkdir build
pushd build pushd build
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')" SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake .. CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" "${CMAKE_EXEC}" ..
make VERBOSE=1 make VERBOSE=1
popd popd
@ -154,13 +146,15 @@ test_custom_script_ops() {
} }
test_jit_hooks() { test_jit_hooks() {
print_cmake_info
echo "Testing jit hooks in cpp" echo "Testing jit hooks in cpp"
pushd test/jit_hooks pushd test/jit_hooks
# Build the custom operator library. # Build the custom operator library.
rm -rf build && mkdir build rm -rf build && mkdir build
pushd build pushd build
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')" SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" cmake .. CMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" "${CMAKE_EXEC}" ..
make VERBOSE=1 make VERBOSE=1
popd popd
@ -172,12 +166,6 @@ test_jit_hooks() {
assert_git_not_dirty assert_git_not_dirty
} }
test_dynamo() {
pushd ../torchdynamo
pytest test
popd
}
if [[ "${TEST_CONFIG}" == *functorch* ]]; then if [[ "${TEST_CONFIG}" == *functorch* ]]; then
test_functorch test_functorch
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
@ -190,11 +178,9 @@ elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_custom_backend test_custom_backend
fi fi
else else
checkout_install_torchdynamo
test_python_all test_python_all
test_libtorch test_libtorch
test_custom_script_ops test_custom_script_ops
test_jit_hooks test_jit_hooks
test_custom_backend test_custom_backend
test_dynamo
fi fi

View File

@ -8,11 +8,6 @@
source "$(dirname "${BASH_SOURCE[0]}")/common.sh" source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Testing pytorch" echo "Testing pytorch"
if [ -n "${CI}" ]; then
# TODO move this to docker
# Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
pip_install "unittest-xml-reporting<=3.2.0,>=2.0.0"
fi
# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015 # Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015
# python tools/download_mnist.py --quiet -d test/cpp/api/mnist # python tools/download_mnist.py --quiet -d test/cpp/api/mnist
@ -28,8 +23,8 @@ time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_a
# FSDP tests # FSDP tests
for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done for f in test/distributed/fsdp/*.py ; do time python test/run_test.py --verbose -i "${f#*/}" ; done
# ShardedTensor tests # ShardedTensor tests
time python test/run_test.py --verbose -i distributed/_shard/checkpoint/test_checkpoint time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint
time python test/run_test.py --verbose -i distributed/_shard/checkpoint/test_file_system_checkpoint time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint
time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec
time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype
@ -50,4 +45,5 @@ time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor
time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor
# Other tests # Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors
assert_git_not_dirty assert_git_not_dirty

View File

@ -62,7 +62,7 @@ if z_value >= 3:
raise Exception('''\n raise Exception('''\n
z-value >= 3, there is high chance of perf regression.\n z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run To reproduce this regression, run
`cd .jenkins/pytorch/perf_test/ && bash {}.sh` on your local machine `cd .ci/pytorch/perf_test/ && bash {}.sh` on your local machine
and compare the runtime before/after your code change. and compare the runtime before/after your code change.
'''.format(test_name)) '''.format(test_name))
else: else:

View File

@ -19,7 +19,7 @@ test_cpu_speed_torch () {
fi fi
if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then if ! python perf-tests/modules/test_cpu_torch.py "${ARGS[@]}"; then
echo "To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change." echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
exit 1 exit 1
fi fi
} }

View File

@ -19,7 +19,7 @@ test_cpu_speed_torch_tensor () {
fi fi
if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then if ! python perf-tests/modules/test_cpu_torch_tensor.py "${ARGS[@]}"; then
echo "To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change." echo "To reproduce this regression, run \`cd .ci/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
exit 1 exit 1
fi fi
} }

View File

@ -2,10 +2,10 @@
SCRIPT_PARENT_DIR=$(dirname "${BASH_SOURCE[0]}") SCRIPT_PARENT_DIR=$(dirname "${BASH_SOURCE[0]}")
# shellcheck source=.jenkins/pytorch/common.sh # shellcheck source=.ci/pytorch/common.sh
source "$SCRIPT_PARENT_DIR/common.sh" source "$SCRIPT_PARENT_DIR/common.sh"
cd .jenkins/pytorch/perf_test cd .ci/pytorch/perf_test
echo "Running CPU perf test for PyTorch..." echo "Running CPU perf test for PyTorch..."

View File

@ -3,7 +3,7 @@
# shellcheck source=./common.sh # shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh" source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
pushd .jenkins/pytorch/perf_test pushd .ci/pytorch/perf_test
echo "Running GPU perf test for PyTorch..." echo "Running GPU perf test for PyTorch..."

View File

@ -6,6 +6,9 @@
set -ex set -ex
echo "Environment variables:"
env
TORCH_INSTALL_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/torch TORCH_INSTALL_DIR=$(python -c "import site; print(site.getsitepackages()[0])")/torch
TORCH_BIN_DIR="$TORCH_INSTALL_DIR"/bin TORCH_BIN_DIR="$TORCH_INSTALL_DIR"/bin
TORCH_LIB_DIR="$TORCH_INSTALL_DIR"/lib TORCH_LIB_DIR="$TORCH_INSTALL_DIR"/lib
@ -16,6 +19,7 @@ BUILD_RENAMED_DIR="build_renamed"
BUILD_BIN_DIR="$BUILD_DIR"/bin BUILD_BIN_DIR="$BUILD_DIR"/bin
export VALGRIND=ON export VALGRIND=ON
export TORCH_INDUCTOR_INSTALL_GXX=ON
if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
# clang9 appears to miscompile code involving c10::optional<c10::SymInt>, # clang9 appears to miscompile code involving c10::optional<c10::SymInt>,
# such that valgrind complains along these lines: # such that valgrind complains along these lines:
@ -97,10 +101,6 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
fi fi
if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
export BUILD_SPLIT_CUDA=ON
fi
if [[ "$TEST_CONFIG" == *crossref* ]]; then if [[ "$TEST_CONFIG" == *crossref* ]]; then
export PYTORCH_TEST_WITH_CROSSREF=1 export PYTORCH_TEST_WITH_CROSSREF=1
fi fi
@ -109,12 +109,8 @@ if [[ "$TEST_CONFIG" == *dynamo* ]]; then
export PYTORCH_TEST_WITH_DYNAMO=1 export PYTORCH_TEST_WITH_DYNAMO=1
fi fi
# TODO: this condition is never true, need to fix this. if [[ "$TEST_CONFIG" == *inductor* ]]; then
if [[ -n "$PR_NUMBER" ]] && [[ -z "$CI_MASTER" || "$CI_MASTER" == "false" ]]; then export PYTORCH_TEST_WITH_INDUCTOR=1
# skip expensive checks when on PR and CI_MASTER flag is not set
export PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=1
else
export PYTORCH_TEST_SKIP_CUDA_MEM_LEAK_CHECK=0
fi fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
@ -125,7 +121,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
# JIT C++ extensions require ninja. # JIT C++ extensions require ninja.
pip_install --user ninja pip_install --user "ninja==1.10.2"
# ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins # ninja is installed in $HOME/.local/bin, e.g., /var/lib/jenkins/.local/bin for CI user jenkins
# but this script should be runnable by any user, including root # but this script should be runnable by any user, including root
export PATH="$HOME/.local/bin:$PATH" export PATH="$HOME/.local/bin:$PATH"
@ -135,9 +131,8 @@ fi
# if you're not careful. Check this if you made some changes and the # if you're not careful. Check this if you made some changes and the
# ASAN test is not working # ASAN test is not working
if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
# Suppress vptr violations arising from multiple copies of pybind11
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=1:strict_init_order=true:detect_odr_violation=0 export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=1:strict_init_order=true:detect_odr_violation=0
export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp export UBSAN_OPTIONS=print_stacktrace=1
export PYTORCH_TEST_WITH_ASAN=1 export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1 export PYTORCH_TEST_WITH_UBSAN=1
# TODO: Figure out how to avoid hard-coding these paths # TODO: Figure out how to avoid hard-coding these paths
@ -180,12 +175,17 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
ulimit -s 81920 ulimit -s 81920
(cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)") (cd test && python -c "import torch; print(torch.__version__, torch.version.git_version)")
echo "The next three invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured" echo "The next four invocations are expected to crash; if they don't that means ASAN/UBSAN is misconfigured"
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_asan(3)") (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_asan(3)")
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_ubsan(0)") (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_csrc_ubsan(0)")
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_vptr_ubsan()")
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_aten_asan(3)") (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_aten_asan(3)")
fi fi
if [[ "$BUILD_ENVIRONMENT" == *-tsan* ]]; then
export PYTORCH_TEST_WITH_TSAN=1
fi
if [[ $TEST_CONFIG == 'nogpu_NO_AVX2' ]]; then if [[ $TEST_CONFIG == 'nogpu_NO_AVX2' ]]; then
export ATEN_CPU_CAPABILITY=default export ATEN_CPU_CAPABILITY=default
elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
@ -219,6 +219,7 @@ test_dynamo_shard() {
echo "NUM_TEST_SHARDS must be defined to run a Python test shard" echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
exit 1 exit 1
fi fi
python tools/dynamo/verify_dynamo.py
# Temporarily disable test_fx for dynamo pending the investigation on TTS # Temporarily disable test_fx for dynamo pending the investigation on TTS
# regression in https://github.com/pytorch/torchdynamo/issues/784 # regression in https://github.com/pytorch/torchdynamo/issues/784
time python test/run_test.py \ time python test/run_test.py \
@ -239,12 +240,172 @@ test_dynamo_shard() {
test_python_dispatch \ test_python_dispatch \
test_fx \ test_fx \
test_package \ test_package \
test_vmap \ test_legacy_vmap \
--shard "$1" "$NUM_TEST_SHARDS" \ --shard "$1" "$NUM_TEST_SHARDS" \
--verbose --verbose
assert_git_not_dirty assert_git_not_dirty
} }
test_inductor_distributed() {
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include distributed/test_dynamo_distributed --verbose
assert_git_not_dirty
}
test_inductor() {
python tools/dynamo/verify_dynamo.py
python test/run_test.py --include test_modules test_ops test_ops_gradients test_torch --verbose
PYTORCH_TEST_WITH_INDUCTOR=0 python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo --verbose
}
test_single_dynamo_benchmark() {
# Usage: test_single_dynamo_benchmark inductor_inference huggingface 0 --args-for-script
# Use test-reports directory under test folder will allow the CI to automatically pick up
# the test reports and upload them to S3. Need to use full path here otherwise the script
# will bark about file not found later on
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local name="$1"
shift
local suite="$1"
shift
# shard id is mandatory, even if it is not passed
local shard_id="$1"
shift
local partition_flags=()
if [[ -n "$NUM_TEST_SHARDS" && -n "$shard_id" ]]; then
partition_flags=( --total-partitions 2 --partition-id "$shard_id" )
fi
# Feel free to remove --device cuda if you ever decide to need to
# test CPU as well in CI
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain --device cuda \
"$@" "${partition_flags[@]}" \
--output "$TEST_REPORTS_DIR/${name}_${suite}.csv"
python benchmarks/dynamo/check_csv.py \
-f "$TEST_REPORTS_DIR/${name}_${suite}.csv"
}
test_aot_eager_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
local exit_status=0
# Check inference with --float32
test_single_dynamo_benchmark "aot_eager_inference" "$@" --backend aot_eager || exit_status=$?
# Check training with --amp
test_single_dynamo_benchmark "aot_eager_training" "$@" --backend aot_eager --training --amp || exit_status=$?
if [[ $exit_status -ne 0 ]]; then
echo "Some benchmarks failed; scroll up for details"
fi
return $exit_status
}
test_inductor_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
# Check inference with --float32
test_single_dynamo_benchmark "inductor_inference" "$@" --inductor
# Check training with --amp
test_single_dynamo_benchmark "inductor_training" "$@" --inductor --training --amp
# Check inference with --dynamic-shapes
test_single_dynamo_benchmark "dynamic_inductor-inference" "$@" --inductor --dynamic-shapes
}
test_inductor_benchmark_perf() {
# Use test-reports directory under test folder will allow the CI to automatically pick up
# the test reports and upload them to S3. Need to use full path here otherwise the script
# will bark about file not found later on
TEST_REPORTS_DIR=$(pwd)/test/test-reports
PARTITION_FLAGS=""
if [[ -n "$NUM_TEST_SHARDS" && -n "$2" ]]; then
PARTITION_FLAGS="--total-partitions 2 --partition-id $2"
fi
mkdir -p "$TEST_REPORTS_DIR"
# Check training with --amp
# Not checking accuracy for perf test for now
# shellcheck disable=SC2086
if [[ "$1" == *smoketest* ]]; then
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--output "$TEST_REPORTS_DIR"/inductor_training_$1.csv
# the reference speedup value is hardcoded in check_hf_bert_perf_csv.py
# this value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_hf_bert_perf_csv.py -f "$TEST_REPORTS_DIR"/inductor_training_$1.csv
# Check memory compression ratio for a few models
for test in hf_Albert timm_efficientdet timm_vision_transformer; do
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \
--disable-cudagraphs --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" \
--only $test --output "$TEST_REPORTS_DIR"/inductor_training_$1_$test.csv
cat "$TEST_REPORTS_DIR"/inductor_training_$1_$test.csv
python benchmarks/dynamo/check_memory_compression_ratio.py --actual \
"$TEST_REPORTS_DIR"/inductor_training_$1_$test.csv \
--expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv
done
else
python benchmarks/dynamo/$1.py --ci --training --performance --disable-cudagraphs\
--device cuda --inductor --amp $PARTITION_FLAGS --output "$TEST_REPORTS_DIR"/inductor_training_$1.csv
fi
}
# No sharding for the periodic job, we don't care if latency is bad
test_aot_eager_all() {
local exit_status=0
PYTHONPATH=$(pwd)/torchbench test_aot_eager_benchmark torchbench "" "$@" || exit_status=$?
test_aot_eager_benchmark huggingface "" "$@" || exit_status=$?
test_aot_eager_benchmark timm_models "" "$@" || exit_status=$?
if [[ $exit_status -ne 0 ]]; then
echo "Some benchmarks failed; scroll up for details"
fi
return $exit_status
}
test_inductor_huggingface() {
test_inductor_benchmark huggingface ""
}
test_inductor_huggingface_perf() {
test_inductor_benchmark_perf huggingface
}
test_inductor_timm_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
exit 1
fi
test_inductor_benchmark timm_models "$1"
}
test_inductor_timm_perf_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
exit 1
fi
test_inductor_benchmark_perf timm_models "$1"
}
test_inductor_torchbench() {
PYTHONPATH=$(pwd)/torchbench test_inductor_benchmark torchbench ""
}
test_inductor_torchbench_perf() {
PYTHONPATH=$(pwd)/torchbench test_inductor_benchmark_perf torchbench
}
test_inductor_torchbench_smoketest_perf(){
PYTHONPATH=$(pwd)/torchbench test_inductor_benchmark_perf smoketest
}
test_python_gloo_with_tls() { test_python_gloo_with_tls() {
source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh" source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
assert_git_not_dirty assert_git_not_dirty
@ -323,20 +484,25 @@ test_libtorch() {
ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libnvfuser* "$TORCH_BIN_DIR"
# Start background download # Start background download
python tools/download_mnist.py --quiet -d test/cpp/api/mnist & python tools/download_mnist.py --quiet -d test/cpp/api/mnist &
# Make test_reports directory # Make test_reports directory
# NB: the ending test_libtorch must match the current function name for the current # NB: the ending test_libtorch must match the current function name for the current
# test reporting process (in print_test_stats.py) to function as expected. # test reporting process to function as expected.
TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_libtorch TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_libtorch
mkdir -p $TEST_REPORTS_DIR mkdir -p $TEST_REPORTS_DIR
# Run JIT cpp tests if [[ "$BUILD_ENVIRONMENT" != *-tsan* ]]; then
python test/cpp/jit/tests_setup.py setup # Run JIT cpp tests
python test/cpp/jit/tests_setup.py setup
fi
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
"$TORCH_BIN_DIR"/test_jit --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml "$TORCH_BIN_DIR"/test_jit --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml
"$TORCH_BIN_DIR"/nvfuser_tests --gtest_output=xml:$TEST_REPORTS_DIR/nvfuser_tests.xml
else else
"$TORCH_BIN_DIR"/test_jit --gtest_filter='-*CUDA' --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml "$TORCH_BIN_DIR"/test_jit --gtest_filter='-*CUDA' --gtest_output=xml:$TEST_REPORTS_DIR/test_jit.xml
fi fi
@ -348,19 +514,19 @@ test_libtorch() {
"$TORCH_BIN_DIR"/test_lazy --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml "$TORCH_BIN_DIR"/test_lazy --gtest_output=xml:$TEST_REPORTS_DIR/test_lazy.xml
fi fi
python test/cpp/jit/tests_setup.py shutdown if [[ "$BUILD_ENVIRONMENT" != *-tsan* ]]; then
python test/cpp/jit/tests_setup.py shutdown
fi
# Wait for background download to finish # Wait for background download to finish
wait wait
# Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy. # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy.
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml
"$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml
# TODO: this condition is never (BUILD_ENVIRONMENT doesn't start with pytorch-), need to fix this. if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then
if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3* ]]; then # TODO: Consider to run static_runtime_test from $TORCH_BIN_DIR (may need modify build script)
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then "$BUILD_BIN_DIR"/static_runtime_test --gtest_output=xml:$TEST_REPORTS_DIR/static_runtime_test.xml
# TODO: Consider to run static_runtime_test from $TORCH_BIN_DIR (may need modify build script)
"$BUILD_BIN_DIR"/static_runtime_test --gtest_output=xml:$TEST_REPORTS_DIR/static_runtime_test.xml
fi
fi fi
assert_git_not_dirty assert_git_not_dirty
fi fi
@ -373,7 +539,7 @@ test_aot_compilation() {
# Make test_reports directory # Make test_reports directory
# NB: the ending test_libtorch must match the current function name for the current # NB: the ending test_libtorch must match the current function name for the current
# test reporting process (in print_test_stats.py) to function as expected. # test reporting process to function as expected.
TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_aot_compilation TEST_REPORTS_DIR=test/test-reports/cpp-unittest/test_aot_compilation
mkdir -p $TEST_REPORTS_DIR mkdir -p $TEST_REPORTS_DIR
if [ -f "$TORCH_BIN_DIR"/test_mobile_nnc ]; then "$TORCH_BIN_DIR"/test_mobile_nnc --gtest_output=xml:$TEST_REPORTS_DIR/test_mobile_nnc.xml; fi if [ -f "$TORCH_BIN_DIR"/test_mobile_nnc ]; then "$TORCH_BIN_DIR"/test_mobile_nnc --gtest_output=xml:$TEST_REPORTS_DIR/test_mobile_nnc.xml; fi
@ -387,7 +553,7 @@ test_vulkan() {
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_TEST_DIR" ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_TEST_DIR"
export VK_ICD_FILENAMES=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/vk_swiftshader_icd.json export VK_ICD_FILENAMES=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/vk_swiftshader_icd.json
# NB: the ending test_vulkan must match the current function name for the current # NB: the ending test_vulkan must match the current function name for the current
# test reporting process (in print_test_stats.py) to function as expected. # test reporting process to function as expected.
TEST_REPORTS_DIR=test/test-reports/cpp-vulkan/test_vulkan TEST_REPORTS_DIR=test/test-reports/cpp-vulkan/test_vulkan
mkdir -p $TEST_REPORTS_DIR mkdir -p $TEST_REPORTS_DIR
LD_LIBRARY_PATH=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/ "$TORCH_TEST_DIR"/vulkan_api_test --gtest_output=xml:$TEST_REPORTS_DIR/vulkan_test.xml LD_LIBRARY_PATH=/var/lib/jenkins/swiftshader/swiftshader/build/Linux/ "$TORCH_TEST_DIR"/vulkan_api_test --gtest_output=xml:$TEST_REPORTS_DIR/vulkan_test.xml
@ -404,7 +570,7 @@ test_distributed() {
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
# NB: the ending test_distributed must match the current function name for the current # NB: the ending test_distributed must match the current function name for the current
# test reporting process (in print_test_stats.py) to function as expected. # test reporting process to function as expected.
TEST_REPORTS_DIR=test/test-reports/cpp-distributed/test_distributed TEST_REPORTS_DIR=test/test-reports/cpp-distributed/test_distributed
mkdir -p $TEST_REPORTS_DIR mkdir -p $TEST_REPORTS_DIR
"$TORCH_BIN_DIR"/FileStoreTest --gtest_output=xml:$TEST_REPORTS_DIR/FileStoreTest.xml "$TORCH_BIN_DIR"/FileStoreTest --gtest_output=xml:$TEST_REPORTS_DIR/FileStoreTest.xml
@ -428,7 +594,7 @@ test_rpc() {
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
echo "Testing RPC C++ tests" echo "Testing RPC C++ tests"
# NB: the ending test_rpc must match the current function name for the current # NB: the ending test_rpc must match the current function name for the current
# test reporting process (in print_test_stats.py) to function as expected. # test reporting process to function as expected.
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR" ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"
@ -557,6 +723,7 @@ test_forward_backward_compatibility() {
# build torch at the base commit to generate a base function schema for comparison # build torch at the base commit to generate a base function schema for comparison
git reset --hard "${SHA_TO_COMPARE}" git reset --hard "${SHA_TO_COMPARE}"
git submodule sync && git submodule update --init --recursive
echo "::group::Installing Torch From Base Commit" echo "::group::Installing Torch From Base Commit"
pip install -r requirements.txt pip install -r requirements.txt
# shellcheck source=./common-build.sh # shellcheck source=./common-build.sh
@ -570,6 +737,7 @@ test_forward_backward_compatibility() {
python dump_all_function_schemas.py --filename nightly_schemas.txt python dump_all_function_schemas.py --filename nightly_schemas.txt
git reset --hard "${SHA1}" git reset --hard "${SHA1}"
git submodule sync && git submodule update --init --recursive
# FC: verify new model can be load with old code. # FC: verify new model can be load with old code.
if ! python ../load_torchscript_model.py /tmp/model_new.pt; then if ! python ../load_torchscript_model.py /tmp/model_new.pt; then
echo "FC check failed: new model cannot be load in old code" echo "FC check failed: new model cannot be load in old code"
@ -649,39 +817,26 @@ test_vec256() {
fi fi
} }
test_dynamo() { test_docs_test() {
pushd ../torchdynamo .ci/pytorch/docs-test.sh
pytest test
popd
} }
test_torch_deploy() { test_executorch() {
python torch/csrc/deploy/example/generate_examples.py # Test torchgen generated code for Executorch.
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR" echo "Testing Executorch op registration"
ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR" "$BUILD_BIN_DIR"/test_edge_op_registration
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
"$TORCH_BIN_DIR"/test_deploy
"$TORCH_BIN_DIR"/test_deploy_gpu
assert_git_not_dirty assert_git_not_dirty
} }
test_docs_test() { if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* || "${BUILD_ENVIRONMENT}" == *-tsan* ]]; then
.jenkins/pytorch/docs-test.sh
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())") (cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())") (cd test && python -c "import torch; print(torch.__config__.parallel_info())")
fi fi
if [[ "${TEST_CONFIG}" == *deploy* ]]; then if [[ "${TEST_CONFIG}" == *backward* ]]; then
install_torchdynamo
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *backward* ]]; then
test_forward_backward_compatibility test_forward_backward_compatibility
# Do NOT add tests after bc check tests, see its comment. # Do NOT add tests after bc check tests, see its comment.
elif [[ "${TEST_CONFIG}" == *xla* ]]; then elif [[ "${TEST_CONFIG}" == *xla* ]]; then
install_torchvision install_torchvision
install_torchdynamo
build_xla build_xla
test_xla test_xla
elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
@ -690,32 +845,126 @@ elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
# TODO: run some C++ tests # TODO: run some C++ tests
echo "no-op at the moment" echo "no-op at the moment"
elif [[ "$TEST_CONFIG" == distributed ]]; then elif [[ "$TEST_CONFIG" == distributed ]]; then
install_torchdynamo install_filelock
install_triton
test_distributed test_distributed
# Only run RPC C++ tests on the first shard # Only run RPC C++ tests on the first shard
if [[ "${SHARD_NUMBER}" == 1 ]]; then if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_rpc test_rpc
fi fi
elif [[ "$TEST_CONFIG" == deploy ]]; then
checkout_install_torchdeploy
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
install_filelock
install_triton
install_huggingface
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy test_without_numpy
install_torchvision install_torchvision
install_torchdynamo install_triton
test_dynamo_shard 1 test_dynamo_shard 1
test_aten test_aten
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision install_torchvision
checkout_install_torchdynamo install_filelock
install_triton
test_dynamo_shard 2 test_dynamo_shard 2
test_dynamo elif [[ "${TEST_CONFIG}" == *aot_eager_all* ]]; then
install_torchtext
install_torchvision
install_filelock
checkout_install_torchbench
install_huggingface
install_timm
if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
# NB: This code path is currently dead because dynamic shapes takes
# too long to run unsharded
test_aot_eager_all --dynamic-shapes
else
test_aot_eager_all
fi
elif [[ "${TEST_CONFIG}" == *aot_eager_huggingface* ]]; then
install_torchvision
install_filelock
install_huggingface
if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
test_aot_eager_benchmark huggingface "" --dynamic-shapes
else
test_aot_eager_benchmark huggingface ""
fi
elif [[ "${TEST_CONFIG}" == *aot_eager_timm* && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
install_filelock
install_timm
id=$((SHARD_NUMBER-1))
if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
test_aot_eager_benchmark timm_models "$id" --dynamic-shapes
else
test_aot_eager_benchmark timm_models "$id"
fi
elif [[ "${TEST_CONFIG}" == *aot_eager_torchbench* ]]; then
install_torchtext
install_torchvision
install_filelock
checkout_install_torchbench
if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
PYTHONPATH=$(pwd)/torchbench test_aot_eager_benchmark torchbench "" --dynamic-shapes
else
PYTHONPATH=$(pwd)/torchbench test_aot_eager_benchmark torchbench ""
fi
elif [[ "${TEST_CONFIG}" == *inductor_huggingface* ]]; then
install_torchvision
install_filelock
install_triton
install_huggingface
if [[ "${TEST_CONFIG}" == *inductor_huggingface_perf* ]]; then
test_inductor_huggingface_perf
else
test_inductor_huggingface
fi
elif [[ "${TEST_CONFIG}" == *inductor_timm* && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
install_filelock
install_triton
install_timm
id=$((SHARD_NUMBER-1))
if [[ "${TEST_CONFIG}" == *inductor_timm_perf* && $NUM_TEST_SHARDS -gt 1 ]]; then
test_inductor_timm_perf_shard $id
else
test_inductor_timm_shard $id
fi
elif [[ "${TEST_CONFIG}" == *inductor_torchbench* ]]; then
install_torchtext
install_torchvision
install_filelock
install_triton
if [[ "${TEST_CONFIG}" == *inductor_torchbench_perf* ]]; then
checkout_install_torchbench
test_inductor_torchbench_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert timm_efficientdet timm_vision_transformer
test_inductor_torchbench_smoketest_perf
else
checkout_install_torchbench
test_inductor_torchbench
fi
elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
install_torchvision
install_filelock
install_triton
test_inductor
test_inductor_distributed
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy test_without_numpy
install_torchvision install_torchvision
install_torchdynamo install_triton
test_python_shard 1 test_python_shard 1
test_aten test_aten
elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision install_torchvision
checkout_install_torchdynamo install_triton
test_python_shard 2 test_python_shard 2
test_libtorch test_libtorch
test_aot_compilation test_aot_compilation
@ -724,7 +973,8 @@ elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_torch_function_benchmark test_torch_function_benchmark
elif [[ "${SHARD_NUMBER}" -gt 2 ]]; then elif [[ "${SHARD_NUMBER}" -gt 2 ]]; then
# Handle arbitrary number of shards # Handle arbitrary number of shards
install_torchdynamo install_torchvision
install_triton
test_python_shard "$SHARD_NUMBER" test_python_shard "$SHARD_NUMBER"
elif [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then elif [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then
test_vulkan test_vulkan
@ -732,13 +982,17 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
test_bazel test_bazel
elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then
test_libtorch test_libtorch
elif [[ "${BUILD_ENVIRONMENT}" == *-tsan* ]]; then
# TODO: TSAN check is currently failing with 415 data race warnings. This will
# be addressed later, the first PR can be merged first to setup the CI jobs
test_libtorch || true
elif [[ "${TEST_CONFIG}" = docs_test ]]; then elif [[ "${TEST_CONFIG}" = docs_test ]]; then
test_docs_test test_docs_test
elif [[ "${TEST_CONFIG}" == *functorch* ]]; then elif [[ "${TEST_CONFIG}" == *functorch* ]]; then
test_functorch test_functorch
else else
install_torchvision install_torchvision
install_torchdynamo install_triton
install_monkeytype install_monkeytype
test_python test_python
test_aten test_aten
@ -749,4 +1003,5 @@ else
test_custom_backend test_custom_backend
test_torch_function_benchmark test_torch_function_benchmark
test_benchmarks test_benchmarks
test_executorch
fi fi

View File

@ -41,12 +41,12 @@ fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
set +ex set +ex
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h torch/ grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/
PYLONG_API_CHECK=$? PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows" echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"
echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`." echo "because \`sizeof(long) == 4\` and \`sizeof(unsigned long) == 4\`."
echo "Please include \"torch/csrc/python_numbers.h\" and use the correspoding APIs instead." echo "Please include \"torch/csrc/utils/python_numbers.h\" and use the correspoding APIs instead."
echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64" echo "PyLong_FromLong -> THPUtils_packInt32 / THPUtils_packInt64"
echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)" echo "PyLong_AsLong -> THPUtils_unpackInt (32-bit) / THPUtils_unpackLong (64-bit)"
echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64" echo "PyLong_FromUnsignedLong -> THPUtils_packUInt32 / THPUtils_packUInt64"

View File

@ -35,11 +35,6 @@ call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b if errorlevel 1 exit /b
if not errorlevel 0 exit /b if not errorlevel 0 exit /b
:: Install ninja and other deps
if "%REBUILD%"=="" ( pip install -q "ninja==1.10.0.post1" dataclasses typing_extensions "expecttest==0.1.3" )
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
:: Override VS env here :: Override VS env here
pushd . pushd .
if "%VC_VERSION%" == "" ( if "%VC_VERSION%" == "" (
@ -85,10 +80,8 @@ set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
set DISTUTILS_USE_SDK=1 set DISTUTILS_USE_SDK=1
set PATH=%TMP_DIR_WIN%\bin;%PATH% set PATH=%TMP_DIR_WIN%\bin;%PATH%
:: Target only our CI GPU machine's CUDA arch to speed up the build, we can overwrite with env var :: The latest Windows CUDA test is running on AWS G5 runner with A10G GPU
:: default on circleci is Tesla T4 which has capability of 7.5, ref: https://developer.nvidia.com/cuda-gpus if "%TORCH_CUDA_ARCH_LIST%" == "" set TORCH_CUDA_ARCH_LIST=8.6
:: jenkins has M40, which is 5.2
if "%TORCH_CUDA_ARCH_LIST%" == "" set TORCH_CUDA_ARCH_LIST=5.2
:: The default sccache idle timeout is 600, which is too short and leads to intermittent build errors. :: The default sccache idle timeout is 600, which is too short and leads to intermittent build errors.
set SCCACHE_IDLE_TIMEOUT=0 set SCCACHE_IDLE_TIMEOUT=0
@ -135,16 +128,22 @@ if "%REBUILD%" == "" (
if not errorlevel 0 exit /b if not errorlevel 0 exit /b
) )
) )
:: tests if BUILD_ENVIRONMENT contains cuda11 as a substring
if not x%BUILD_ENVIRONMENT:cuda11=%==x%BUILD_ENVIRONMENT% (
set BUILD_SPLIT_CUDA=ON
)
python setup.py bdist_wheel && sccache --show-stats && python -c "import os, glob; os.system('python -mpip install ' + glob.glob('dist/*.whl')[0] + '[opt-einsum]')" ( python setup.py bdist_wheel
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
sccache --show-stats
python -c "import os, glob; os.system('python -mpip install --no-index --no-deps ' + glob.glob('dist/*.whl')[0])"
(
if "%BUILD_ENVIRONMENT%"=="" ( if "%BUILD_ENVIRONMENT%"=="" (
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash. echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
) else ( ) else (
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\caffe2 %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\" if "%USE_CUDA%"=="1" (
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\nvfuser && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
) else (
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
)
if errorlevel 1 exit /b if errorlevel 1 exit /b
if not errorlevel 0 exit /b if not errorlevel 0 exit /b

View File

@ -6,10 +6,6 @@ if not errorlevel 0 (
exit /b exit /b
) )
echo "Installing test dependencies"
pip install networkx
if errorlevel 1 exit /b
echo "Test functorch" echo "Test functorch"
pushd test pushd test
python run_test.py --functorch --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose python run_test.py --functorch --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

View File

@ -13,7 +13,7 @@ if not exist %CONDA_PARENT_DIR%\Miniconda3 (
) )
if "%INSTALL_FRESH_CONDA%"=="1" ( if "%INSTALL_FRESH_CONDA%"=="1" (
curl --retry 3 -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe curl --retry 3 --retry-all-errors -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe --output %TMP_DIR_WIN%\Miniconda3-latest-Windows-x86_64.exe
if errorlevel 1 exit /b if errorlevel 1 exit /b
if not errorlevel 0 exit /b if not errorlevel 0 exit /b
@ -24,13 +24,3 @@ if "%INSTALL_FRESH_CONDA%"=="1" (
:: Activate conda so that we can use its commands, i.e. conda, python, pip :: Activate conda so that we can use its commands, i.e. conda, python, pip
call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3 call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3
if "%INSTALL_FRESH_CONDA%"=="1" (
call conda install -y -q numpy"<1.23" cffi pyyaml boto3 libuv
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
call conda install -y -q -c conda-forge cmake=3.22.3
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)

View File

@ -24,7 +24,7 @@ if "%CUDA_SUFFIX%" == "" (
if "%REBUILD%"=="" ( if "%REBUILD%"=="" (
if "%BUILD_ENVIRONMENT%"=="" ( if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --output %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z
) else ( ) else (
aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet aws s3 cp s3://ossci-windows/magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z %TMP_DIR_WIN%\magma_2.5.4_%CUDA_SUFFIX%_%BUILD_TYPE%.7z --quiet
) )

View File

@ -1,6 +1,6 @@
if "%REBUILD%"=="" ( if "%REBUILD%"=="" (
if "%BUILD_ENVIRONMENT%"=="" ( if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z
) else ( ) else (
aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet
) )

View File

@ -7,8 +7,8 @@ if "%REBUILD%"=="" (
del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul
del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul
if "%BUILD_ENVIRONMENT%"=="" ( if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe
curl --retry 3 -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe
) else ( ) else (
aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe
aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe

View File

@ -14,13 +14,6 @@ call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b if errorlevel 1 exit /b
if not errorlevel 0 exit /b if not errorlevel 0 exit /b
:: extra conda dependencies for testing purposes
if NOT "%BUILD_ENVIRONMENT%"=="" (
call conda install -y -q mkl protobuf numba scipy=1.6.2 typing_extensions dataclasses
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)
pushd . pushd .
if "%VC_VERSION%" == "" ( if "%VC_VERSION%" == "" (
call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64 call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64
@ -32,14 +25,6 @@ if not errorlevel 0 exit /b
@echo on @echo on
popd popd
:: The version is fixed to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
=======
:: Pin unittest-xml-reporting to freeze printing test summary logic, related: https://github.com/pytorch/pytorch/issues/69014
pip install "ninja==1.10.0.post1" future "hypothesis==5.35.1" "expecttest==0.1.3" "librosa>=0.6.2" "scipy==1.6.3" psutil pillow "unittest-xml-reporting<=3.2.0,>=2.0.0" pytest pytest-xdist pytest-shard pytest-rerunfailures "xdoctest==1.0.2" "pygments==2.12.0" "opt-einsum>=3.3"
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
set DISTUTILS_USE_SDK=1 set DISTUTILS_USE_SDK=1
if not "%USE_CUDA%"=="1" goto cuda_build_end if not "%USE_CUDA%"=="1" goto cuda_build_end

View File

@ -1,6 +1,6 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
git submodule update --init --recursive --jobs 0 third_party/pybind11 git submodule update --init --recursive third_party/pybind11
cd test\custom_backend cd test\custom_backend
:: Build the custom backend library. :: Build the custom backend library.

View File

@ -1,6 +1,6 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
git submodule update --init --recursive --jobs 0 third_party/pybind11 git submodule update --init --recursive third_party/pybind11
cd test\custom_operator cd test\custom_operator
:: Build the custom operator library. :: Build the custom operator library.

Some files were not shown because too many files have changed in this diff Show More